In bioinformatics, cutting-edge algorithms and sophisticated bioinformatics tools receive the most interest and attention. Meanwhile, the Unix command-line interface stands as the silent orchestrator, quietly facilitating the efficient manipulation and transformation of data.
In this installment of Decoding Biology, I'll provide an overview of the Unix command line, including what it is and the fundamental commands used by bioinformaticians. While not flashy or attention-grabbing, this piece is part of a broader series aimed at equipping bioinformaticians with fundamental skills, bridging the gap between raw data and meaningful insights. To read the first three installment of this series click the following links: Python Fundamentals For Biologists, Introduction To Biopython, and Data Visualization Fundamentals For Biologists.
🦠 Unix Fundamentals For Bioinformatics
Unix is a computer operating system best known for its powerful command-line interface, which allows users to interact with the system by typing commands. Instead of clicking and pressing buttons, as you would with a graphical user interface (GUI), when using Unix, you enter words and symbols that act as commands and instruct the computer to perform various processes. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks.
Unix is widely used in bioinformatics because of its flexibility, scalability, and powerful command-line tools. Many bioinformatics software tools and pipelines are designed to run in a Unix environment, and the command-line interface allows bioinformaticians to perform complex data analysis and manipulation tasks efficiently. In this newsletter, I’ll provide a crash course on basic Unix commands for bioinformatics, including an overview of essential Unix commands to navigate a file system and move, copy, edit files, and more.
◍◍◍◍◍
🦠 Navigating The Terminal and File System
💻 The Terminal:
A Unix terminal, also known as a command-line interface or shell, is a text-based interface that allows users to interact with the operating system by typing commands. In simple terms, a terminal is like a text-based "conversation" with the computer, where users can instruct it to perform various tasks, such as running programs or viewing files by typing commands.
Being comfortable with the terminal is fundamental in bioinformatics because it empowers bioinformaticians to leverage powerful command-line tools, automate tasks, work with high-performance computing systems, and efficiently manage and analyze biological data. It enhances productivity, flexibility, and the ability to navigate the complexities of bioinformatics workflows. If you open the terminal application on your computer (for Mac users) you should see something like the following:
After accessing your terminal the next thing you’ll want to learn is how to navigate your file system. To do so you’ll need to know three different commands: ls
, cd
, and pwd
.
💻 List Files and Directories:
The ls
command is a fundamental tool in Unix that is regularly used in bioinformatics for navigating directories, checking file contents, and verifying the output of analyses. It is crucial in maintaining organization and ensuring bioinformaticians have the necessary information about their files and directories. The code block below demonstrates the basic syntax for the ls
command:
ls # List files in the current directory
ls -l # List files in long format
ls /path/to/dir # List files in a specific directory
After running
Bioinformaticians often use ls
command to check a directory's contents before running analyses or processing data to ensure that the necessary files are present. Additionally, after running bioinformatics analyses, the ls
command is useful for verifying the output files generated and reviewing their details.
💻 Changing Directories:
In Unix, the command to change directories is cd. Changing directories is a fundamental and frequently used command in bioinformatics. It facilitates efficient organization, navigation, and execution of bioinformatics workflows, ensuring that the necessary data and tools are readily accessible during analyses. The code block below demonstrates the basic syntax for changing directories:
cd # Change to the home directory
cd relative/directory # Change to a relative path
cd /path/to/directory # Change to an absolute path
cd .. # Move up one directory (parent directory)
As you can see in the code block above, if you provide the absolute or relative path, the cd command will change to that directory. However, if no arguments are provided, it will take you to your home directory. The absolute file path describes the location of a file from the root directory. In contrast, the relative path describes the file's location relative to the current working directory. This is important because bioinformatics projects often involve multiple directories for storing raw data, processed data, scripts, and results. As a result, changing directories allows bioinformaticians to navigate to relevant project directories more easily.
💻 Print Current Working Directory:
The pwd
command in Unix stands for "Print Working Directory." It is used to display the absolute path of the current working directory to the terminal The syntax for the pwd
command is as follows:
pwd
The pwd
command is a simple yet valuable tool in bioinformatics for confirming, documenting, and managing file paths during analyses. It provides bioinformaticians with information about their current working directory, aiding in the effective navigation and organization of bioinformatics projects.
◍◍◍◍◍
🦠 Working With Files and Directories:
💻 Copy Files or Directories:
The cp
command in Unix, used to copy files or directories, plays a crucial role in data organization, backup creation, result preservation, and overall project management in bioinformatics workflows. The basic syntax for the cp
command is as follows:
cp [
source] [
destination] # Generic syntax example
cp genome.txt /GenomicData/Project1 # Specific syntax example
In addition to the generic syntax above, there are other common options used with the cp
command, as demonstrated below:
cp -r genome.txt /GenomicData # Recursively copy directories
cp -i genome.txt # Prompt before overwriting files
cp -u genome.txt # Copy only when source file is newer than destination
Bioinformatics projects involve the manipulation and analysis of diverse datasets. The cp
command helps organize data by allowing bioinformaticians to make copies of specific files or directories and arrange them in a structured manner. Additionally, when working with raw data files, it's common to make copies of the original data to avoid accidental modifications or to ensure the original data is preserved in case modifications lead to unexpected results.
💻 Move, Rename, and Remove Files or Directories:
The mv
command in Unix, used to move or rename files, plays an important role in data organization and maintaining a clean and structured workspace throughout the various stages of a bioinformatics workflow. The basic syntax for the mv
command is as follows:
mv [source] [destination] # Generic syntax example
mv genome.txt /GenomicData/Project1 # Specific syntax example
mv old_name.txt new_name.txt # Rename file
mv -i genome.txt # Prompt before overwriting files
mv -u genome.txt # Move only when source file is newer than destination
In bioinformatics, the mv
command comes in handy when handling temporary files. After completing an analysis, you can move the results to an archive to keep your main project directory uncluttered. Another way to keep your project directory uncluttered is to use Unix’s rm command, which removes or deletes files and directories. The basic syntax for the rm
command is as follows:
rm file_name.txt # Remove a file
rm -i file.txt # Prompt before removing a file
rm -f file.txt # Force remove a file without prompting
rm -r directory # Remove a directory and its contents
It's important to use the rm command cautiously, especially with the -r
and -f
options, as it can lead to the irreversible deletion of files and directories. Double-checking and verifying the files to be removed or using the -i
option to prompt for confirmation are good practices to avoid accidental data loss.
💻 Creating and Removing Directories:
Bioinformatics projects often involve creating a directory structure to organize raw data, processed data, scripts, and results. In Unix, the mkdir
command helps create project directories with the specified names. The basic syntax for the mkdir
command is as follows:
mkdir directory_name # Create a new directory
Whereas the mkdir
command is used to create new directories, the rmdir
command removes empty directories.
rmdir directory_name # Remove an empty directory
It's important to note that the rmdir
command can only remove directories that are empty. If a directory contains files or subdirectories, the rmdir
command will not work unless the -p
option is used.
◍◍◍◍◍
🦠 Viewing and Editing Files:
💻 Displaying The Contents Of a File:
In Unix, the cat
command is used to concatenate and display the contents of files. Bioinformaticians often use the cat
command to quickly inspect data files or analyses' outputs. Additionally, when working with bioinformatics data, there may be a need to concatenate the contents of multiple files. The cat
command can be used to combine files into a single stream. In the code block below, you'll learn the basic syntax for the cat
command:
cat file_name.txt # Display the contents of a file
cat -n file_name.txt # Display contents of file w/ line numbers
cat f1.txt f2.txt # Concatenate & display the contents of multiple files
cat f1.txt f2.txt > new_fule.txt # Concatenate files and create new file
It's worth mentioning that while the cat
command is handy for certain tasks, in more complex scenarios, Unix commands like less
or more
may be more appropriate for viewing and navigating through large datasets since they allow users to navigate through the content one screen at a time, making it easier to read and search through extensive datasets. Additionally, both commands provide a way to view text files without loading the entire file into memory, which can be important for large bioinformatics datasets.
In Unix, the more
command displays the contents of a file one screen at a time. After displaying a screen, it waits for user input to continue to the next screen or quit the display. The basic syntax is as follows:
more file_name.txt
# To display the next line press enter
# To display the next screen press space
# To quite the display press q
The less
command is an improved version of more
. It provides more features and allows for both forward and backward navigation through the file. The basic syntax is as follows:
les file_name.txt
# To display the next line press enter
# To display the next screen press space
# To quite the display press q
# To search forward for a specific pattern type /pattern
# To search backward for a specific pattern type ?pattern
💻 Text Editor For Creating and Modifying Files
The nano
command in Unix is a text editor commonly used in the terminal for simple text editing tasks. It is designed to be user-friendly and is especially suitable for users who may not be familiar with more advanced text editors like vim. The nano
command provides a basic and straightforward interface for creating and editing text files, which you can open with the code below:
nano file_name
When you open a file with nano
, you are presented with a text editor interface within the terminal. The bottom of the screen displays various commands that you can use for different operations, as demonstrated below:
Bioinformatics tools and software often use configuration files, and the nano
command provides a simple way to edit these files directly in the terminal. Additionally, bioinformaticians may use the nano
command for quick edits and modifications. However, advanced editors like vim or IDEs are preferred for more extensive coding tasks.
◍◍◍◍◍
🦠 Searching and Filtering:
💻 Searching For Patterns In Files:
The grep
command in Unix is a powerful tool for searching and matching patterns within text files. In bioinformatics, the grep
command is used to search for and extract relevant information from text-based data files efficiently. Additionally, it plays a crucial role in tasks ranging from sequence analysis to quality control and data exploration. The code block below demonstrates the basic syntax for grep
:
grep pattern file # Basic syntax
grep -i pattern file # Case-insensitive search
grep -n pattern file # Display line numbers with matched line(s)
# pattern refers to the tex expression you wish to search for
# file refers to the name of the file(s) to search for the pattern in
When you run the grep
command, it scans the specified file(s) line by line, searching for the specified pattern. If a line contains the specified pattern, grep
will print that line to the terminal. In effect, the grep command works similarly to regex in Python, which I covered in a previous article titled, Identifying DNA Binding Motifs With Regular Expressions.
💻 Filtering and Transforming Data:
The awk
in Unix command is a versatile tool for pattern scanning and text processing. In essence, the awk command utilizes a programming language designed for processing and analyzing text data. The basic syntax for awk
is as follows:
awk 'pattern { action }' file_name
# pattern specifies the condition for executing the action
# action specifies the action to performed when the pattern is matched
The awk
command reads the specified file line-by-line to evaluate the specified pattern. If the pattern is true, then the association action is performed. The sed
command is similar, but it performs search-and-replace operations instead of performing a specified action. The basic syntax for the sed
command is as follows:
sed 's/pattern/replacement/' file_name
# pattern species the text pattern to search for
# replacement species the text to replace the matched pattern
Both awk
and sed
are useful tools in bioinformatics for text processing and manipulation. They provide bioinformaticians with the flexibility to extract, transform, and clean data efficiently, facilitating various aspects of data analysis and interpretation.
◍◍◍◍◍
🦠 Pipelines and Redirection:
💻 Creating A Pipeline:
The |(pipe) command in Unix combines the output of one command and uses it as the input for another command, allowing you to chain multiple commands together, creating a pipeline for data processing. The basic syntax for the pipe command is as follows:
command_1 | command_2
In the code example above, the first command’s (command_1) output is used as the input for the following command (command_2). The second command then processes the output from the first command. The type of modularity the pipe command provides allows bioinformaticians to create efficient and flexible workflows for data processing, filtering, and integrating various tools.
💻 Input and Output Redirection:
In Unix, the > and < symbols are used for input and output redirection, allowing you to control where the input for a command comes from or where the output of a command goes. The > symbol redirects the standard output of a command to a file, and it creates or overwrites the specified file with the output of the command. The syntax for > is as follows:
command > output_file # Generic syntax
echo "Hello, World!" > output_file.txt # Specific example
# command refers to the command whose output you wish to redirect
# output_file refers to the file which the output will be written
The < symbol, on the other hand, is used to redirect the standard input of a command from a file. It takes the contents of the specified file and provides it as input to the command. The syntax for this command is as follows:
command < input_file # Generic syntax
grep "pattern" < input_file.txt # Specific example
# command refers to the command that will take input from the file
# input_file is the file from which the input is read
Output and input redirection are fundamental concepts in Unix, and they play a crucial role in bioinformatics for managing data and results efficiently within various analysis workflows. For example, Bioinformatics analyses often produce results that need to be saved. The >
symbol is used to redirect the output of tools and commands to files, allowing bioinformaticians to store and analyze results, as demonstrated below:
bioinformatics_analysis_result > results.txt
◍◍◍◍◍
🦠 Downloading and Exploring Files:
In this last section, I will show you how to download the contents of files with Unix; then, I'll incorporate concepts from earlier in this newsletter to show you how to view and explore the contents of said files. The file we'll download contains genomic data for Saccharomyces cerevisiae, also known as baker's yeast. To start, we'll want to create a new directory to house the Saccharomyces cerevisiae genomic data, as demonstrated below:
conda activate bioinfo # Active conda environment
mdkir new_analysis # Make new directory called new_analysis
cd new_analysis # Change directory to new_analysis
Next, I’ll download Saccharomyces cerevisiae’s genomic data from the Saccharomyces Genome Database (SGD). After navigating to the website and finding the file I want to download, I select “copy link”, as demonstrated in the image below:
Next, I use Unix’s wget
command, which allows you to download files from the internet using the HTTP, HTTPS, or FTP protocols. It is a non-interactive command-line tool, meaning you provide the URL of the file you want to download, and wget retrieves it for you. In the code block below, i’ll demonstrate the wget
command with my chosen file’s URL:
# Store URL is variable
URL=http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab
# Download URL with wget
wget $URL
#Alternatively, you can use the curl command
curl $URL > SGD_features.tab
The SGD_features.tab file containing Saccharomyces cerevisiae’s genomic data is now downloaded to my directory called new_analysis2. To view the contents of this file one page at a time, I can use Unix’s more command as demonstrated below:
more SGD_features.tab
Which produces the following output:
Now, let’s say I wanted more information on the TEL01L, a telomeric region on the left arm of chromosome 1 in Saccharomyces cerevisiae. To retrieve this information, I can use the following code:
cat SGD_features.tab | grep TEL01L --color=always
The code above uses the cat
command to concatenate and display the contents of our file named SGD_features.tab. The | (pipe)
command is then used to take the output from the cat command and pass it as input to the next command (grep
TEL01L). The grep
command is then used to search for lines in the input that contain the pattern TEL01L. Finally, the —-color=always
command is used to highlight the matched pattern, making it easier to spot visually. The code above then produces the following output:
Finally, If I wanted to save the information display above In a new file I cause use the >
(redirection) command in the following manner to produce a new file called TEL01L.tab as demonstrated below:
cat SGD_features.tab | grep TEL01L --color=always > TEL01L.tab
This is music to my ears. I can now use an old obsession of mine in a whole new way..thanks for reminding us that we don't always have to look for the latest technology to power our projects.