Go On, Git!
An Introduction to Git for Version Control In Bioinformatics
Decoding Biology Shorts is a new weekly newsletter sharing tips, tricks, and lessons in bioinformatics. Enjoyed this piece? Show your support by tapping the ❤️ in the header above. Your small gesture goes a long way in helping me understand what resonates with you and in growing this newsletter. Thank you!
🧬 An Introduction to Git for Version Control In Bioinformatics
In a previous post, Looping Through Repetitive Tasks, I discussed how automating bioinformatics workflows makes analyses more robust and reproducible. Today’s post builds off the topic of reproducibility. Specifically, we’ll explore the basics of version control with Git, which allows you to manage changes to a project systematically, thereby enhancing reproducibility, especially in collaborative projects.
Whether you’re a seasoned coder or just starting out in bioinformatics, chances are you’ve encountered version control—perhaps without realizing it. Imagine working on a Python script to analyze gene expression data. You start with gene_analysis.py and create modified versions like gene_analysis_v2.py or gene_analysis_with_filters.py as you refine your analysis. This kind of "manual" versioning helps you keep track of changes and revert to earlier versions of your script when needed. However, while useful, this approach doesn’t scale well to complex bioinformatics projects or collaborative environments, where consistent and centralized tracking is crucial.
Fortunately, software engineers have developed tools and best practices to handle these challenges. Git, a widely used version control system, is particularly powerful for both individual and collaborative projects. Paired with platforms like GitHub, Git simplifies sharing code and tracking changes across time.
In this guide, we’ll cover Git fundamentals and best practices. Whether you’re managing personal bioinformatics scripts or preparing for collaborative research, mastering Git will save time and frustration.
🧬 Installing Git and Getting Started
To begin using Git, you first need to install it. If you’re using macOS, you can install Git using Homebrew with the following command:
$ brew install git
Once Git is installed, you’ll need to configure it to identify yourself, which is crucial for keeping track of who made changes in collaborative projects. To configure Git you can use the following commands (you can verify the configurations by running git config --list):
$ git config --global user.name “Evan Peikon”
$ git config --global user.email “notmyrealemail@gmail.com”
Another useful Git setting is to enable terminal colors, which can help you visually indicate changes. You can do this with the code…
Additionally, to make the Git experience more user friendly, especially when working in the terminal, you can enable color-coded output. This makes it easier to visually differentiate between changes:
$ git config - -global color.ui true
These steps ensure Git is ready to track your contributions and make the command-line interface more intuitive.
🧬 Tracking and Staging Files In Git
To get started with Git, you first need to initialize a directory as a Git repository, which is a directory that is under version control. This repository will store both your working files and snapshots of the project at various time points (commits). Think of these snapshots as milestones that allow you to revisit earlier versions of your work, such as when testing multiple approaches to analyze RNA-Seq data. This may seem like extra work, but it’s important to remember that version control is like an insurance policy. By taking an extra few minutes of time now, you’ll save yourself hours in the future should an issue arise.
To create a new repository, navigate to the directory you want to track and use the following code1:
$ git init
For example, let’s say I have a directory on my desktop titled RNAseq_Project. I can change into the RNAseq_Project directory, then type git init, as demonstrated below:
$ cd ~/Desktop/RNASeq_Project
$ git init
The output will then confirm the repository initialization:
Initialized empty Git repository in /Users/evanpeikon/Desktop/RNASeq_Project/.git/
The above output tells me that I’ve now initialized the RNASeq_Project Git repository. However, just because we’ve now initialized this repository does not mean that Git will start automatically tracking everything for us (this is a feature, not a bug). Rather, you have to tell Git what you want to track with the git add command, which we’ll cover shortly.
However, before tracking anything we’ll cover the git status command, which inspects the contents of our working directory and staging area. In the code block below, I’m first going to create a python script in the RNASeq_Project repository, and then I’m going to run the git status command:
$ nano differential_expression.py
$ git status
Which produces the following output:
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
differential_expression.py
nothing added to commit but untracked files present (use "git add" to track)
As you can see in the image above, git status tells us that we have an untracked file named differential_expression.py. To start tracking this file we can use the following code:
$ git add differential_expression.py
Now, you run git status again, you’ll notice the file has moved to the "Changes to be committed" section:
On branch main
No commits yet
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: differential_expression.py
As you can see, there are still no commits, but instead of our script being under the section “Untracked Changes”, it’s now under “Changes to be committed”. As a result, if we made a commit right now, it would take a snapshot of the differential_expression.py script exactly as it was when we used the git add command. Any further changes we made to the script since then would not be included in the upcoming commit unless we ran git add again. This distinction ensures that you only commit specific versions of your files.
*Note: If you accidentally stage a file with git add that you do not want to commit, you can unstage it with git reset HEAD, followed by the filename,
which unstages file changes in the staging area.
🧬 Committing Changes In Git
In the last section I introduced the concept of a commit, which is like taking a snapshot of your project at a specific point in time (a commit permanently records the stage of stages files in your repository). To make a commit all you need to do is use the git commit command with a descriptive message about the changes, as demonstrated below (this message helps you and your collaborators understand the purpose of the commit, which is particularly useful in bioinformatics workflows where multiple scripts and datasets are involved):
$ git commit -m "Added script for differential expression analysis"
The output will summarize the commit:
[main (root-commit) e6e9011] Added script for differential expression analysis
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 differential_expression.py
Up to this point we’ve covered how to stage and commit changes in a repository using git add and git commit, as well as how to inspect the contents of a working directory and staging area with git status.
Another useful command in this process is git diff, which shows the differences between your current working directory and staging area (i.e., what changes have been made to a file). For example, let’s say I make a few additional edits to differential_expression.py. I can then use git diff to see the differences as demonstrated below:
$ git diff
This command will display the exact lines added or modified, as demonstrated below:
diff --git a/differential_expression.py b/differential_expression.py
index e2c6ca3..f1797ca 100644
--- a/differential_expression.py
+++ b/differential_expression.py
@@ -1 +1,3 @@
# Differential expression analysis script
+# Added filtering for low-expression genes
+# Updated visualization parameters
This is especially helpful for bioinformatics projects, where even minor changes in a script can significantly impact downstream results.
Additionally, Git maintains a history of all commits, allowing you to track changes over time. We can use git log to list all of these previous commits, as demonstrated below:
$ git log
This provides a chronological list of commits, including the commit message, author, and a unique identifier (SHA):
*Note: If we want to see more information we can also use the git show HEAD command, which displays displays everything the git log command displays for the head commit, plus the file changes that were committed
commit 8f2fd52cf875becbbbd0f97ed16b41c0e411b495 (HEAD -> main)
Author: Evan Peikon <evanpeikon@gmail.com>
Date: Mon Dec 30 13:39:08 2024 -0500
Improved filtering for low-expression genes
commit e6e9011a2840b1036c2eb9ac7c6c6b55c9b1a33e
Author: Evan Peikon <evanpeikon@gmail.com>
Date: Mon Dec 30 13:29:21 2024 -0500
Added script for differential expression analysis
In the output above you can see a long string of letters and numbers after the commit, which is called an SHA. The SHA is a unique identifier for each commit, and if you need to revert to a previous state, you can use the first 7 characters of the SHA with the git reset command, as demonstrated below :
$ git reset <commit_SHA>
🧬 Closing Thoughts
Bioinformatics projects often involve large datasets, custom scripts, and collaborative workflows. Version control provides an "insurance policy," ensuring that you can trace changes, recover prior states, and collaborate effectively. By integrating Git into your workflow, you’ll save hours of troubleshooting and enhance the reproducibility of your analyses.
To quickly summarize the key points from this post, I’ve included a handy list of Git commands for your reference below:
git init creates a new git repository
git status inspects the contents of a working directory and staging area
git add <file name> adds files from the working directory to the staging area
git diff shows the differences between the working directory and staging area
git commit permanently stores file changes from the staging area in the repository
git log shows a list of all previous commits
git show HEAD displays everything the
git log
command displays for the head commit, plus the file changes that were committed.git checkout HEAD discards changes in the working directory so the working directory looks exactly the same as it did when you last made a commit.
git reset HEAD <filename> unstages file changes in the staging area
git reset <commit SHA> resets to a previous commit in the commit history using first 7 characters of SHA for said previous commit.
If you don’t know how to navigate your directory via the command-line, I recommend checking out the following guide before you continue reading this article: Bash Fundamentals for Bioinformatics.