Unix Fundamentals For Bioinformatics

A Beginner's Guide to Unix for Bioinformatics

Dec 07, 2023

In bioinformatics, cutting-edge algorithms and sophisticated bioinformatics tools receive the most interest and attention. Meanwhile, the Unix command-line interface stands as the silent orchestrator, quietly facilitating the efficient manipulation and transformation of data.

In this installment of Decoding Biology, I'll provide an overview of the Unix command line, including what it is and the fundamental commands used by bioinformaticians. While not flashy or attention-grabbing, this piece is part of a broader series aimed at equipping bioinformaticians with fundamental skills, bridging the gap between raw data and meaningful insights. To read the first three installment of this series click the following links: Python Fundamentals For Biologists, Introduction To Biopython, and Data Visualization Fundamentals For Biologists.

🦠 Unix Fundamentals For Bioinformatics

Unix is a computer operating system best known for its powerful command-line interface, which allows users to interact with the system by typing commands. Instead of clicking and pressing buttons, as you would with a graphical user interface (GUI), when using Unix, you enter words and symbols that act as commands and instruct the computer to perform various processes. It may seem archaic to use a keyboard to issue commands today, but it’s much easier to automate keyboard tasks than mouse tasks.

Unix is widely used in bioinformatics because of its flexibility, scalability, and powerful command-line tools. Many bioinformatics software tools and pipelines are designed to run in a Unix environment, and the command-line interface allows bioinformaticians to perform complex data analysis and manipulation tasks efficiently. In this newsletter, I’ll provide a crash course on basic Unix commands for bioinformatics, including an overview of essential Unix commands to navigate a file system and move, copy, edit files, and more.

Text within this block will maintain its original spacing when published

                                                                  ◍◍◍◍◍

🦠 Navigating The Terminal and File System

💻 The Terminal:

A Unix terminal, also known as a command-line interface or shell, is a text-based interface that allows users to interact with the operating system by typing commands. In simple terms, a terminal is like a text-based "conversation" with the computer, where users can instruct it to perform various tasks, such as running programs or viewing files by typing commands.

Being comfortable with the terminal is fundamental in bioinformatics because it empowers bioinformaticians to leverage powerful command-line tools, automate tasks, work with high-performance computing systems, and efficiently manage and analyze biological data. It enhances productivity, flexibility, and the ability to navigate the complexities of bioinformatics workflows. If you open the terminal application on your computer (for Mac users) you should see something like the following:

After accessing your terminal the next thing you’ll want to learn is how to navigate your file system. To do so you’ll need to know three different commands: ls, cd, and pwd.

💻 List Files and Directories:

The ls command is a fundamental tool in Unix that is regularly used in bioinformatics for navigating directories, checking file contents, and verifying the output of analyses. It is crucial in maintaining organization and ensuring bioinformaticians have the necessary information about their files and directories. The code block below demonstrates the basic syntax for the ls command:

ls                # List files in the current directory
ls -l             # List files in long format
ls /path/to/dir   # List files in a specific directory

After running

Bioinformaticians often use ls command to check a directory's contents before running analyses or processing data to ensure that the necessary files are present. Additionally, after running bioinformatics analyses, the ls command is useful for verifying the output files generated and reviewing their details.

💻 Changing Directories:

In Unix, the command to change directories is cd. Changing directories is a fundamental and frequently used command in bioinformatics. It facilitates efficient organization, navigation, and execution of bioinformatics workflows, ensuring that the necessary data and tools are readily accessible during analyses. The code block below demonstrates the basic syntax for changing directories:

cd                        # Change to the home directory
cd relative/directory     # Change to a relative path
cd /path/to/directory     # Change to an absolute path
cd ..                     # Move up one directory (parent directory)

As you can see in the code block above, if you provide the absolute or relative path, the cd command will change to that directory. However, if no arguments are provided, it will take you to your home directory. The absolute file path describes the location of a file from the root directory. In contrast, the relative path describes the file's location relative to the current working directory. This is important because bioinformatics projects often involve multiple directories for storing raw data, processed data, scripts, and results. As a result, changing directories allows bioinformaticians to navigate to relevant project directories more easily.

💻 Print Current Working Directory:

The pwd command in Unix stands for "Print Working Directory." It is used to display the absolute path of the current working directory to the terminal The syntax for the pwd command is as follows:

pwd

The pwd command is a simple yet valuable tool in bioinformatics for confirming, documenting, and managing file paths during analyses. It provides bioinformaticians with information about their current working directory, aiding in the effective navigation and organization of bioinformatics projects.

Text within this block will maintain its original spacing when published

                                                                  ◍◍◍◍◍

🦠 Working With Files and Directories:

💻 Copy Files or Directories:

The cp command in Unix, used to copy files or directories, plays a crucial role in data organization, backup creation, result preservation, and overall project management in bioinformatics workflows. The basic syntax for the cp command is as follows:

cp [source] [destination]             # Generic syntax example
cp genome.txt /GenomicData/Project1   # Specific syntax example

In addition to the generic syntax above, there are other common options used with the cp command, as demonstrated below:

cp -r genome.txt /GenomicData # Recursively copy directories
cp -i genome.txt # Prompt before overwriting files
cp -u genome.txt # Copy only when source file is newer than destination

Bioinformatics projects involve the manipulation and analysis of diverse datasets. The cp command helps organize data by allowing bioinformaticians to make copies of specific files or directories and arrange them in a structured manner. Additionally, when working with raw data files, it's common to make copies of the original data to avoid accidental modifications or to ensure the original data is preserved in case modifications lead to unexpected results.

💻 Move, Rename, and Remove Files or Directories:

The mv command in Unix, used to move or rename files, plays an important role in data organization and maintaining a clean and structured workspace throughout the various stages of a bioinformatics workflow. The basic syntax for the mv command is as follows:

mv [source] [destination]             # Generic syntax example
mv genome.txt /GenomicData/Project1   # Specific syntax example
mv old_name.txt new_name.txt # Rename file
mv -i genome.txt # Prompt before overwriting files
mv -u genome.txt # Move only when source file is newer than destination

In bioinformatics, the mv command comes in handy when handling temporary files. After completing an analysis, you can move the results to an archive to keep your main project directory uncluttered. Another way to keep your project directory uncluttered is to use Unix’s rm command, which removes or deletes files and directories. The basic syntax for the rm command is as follows:

rm file_name.txt  # Remove a file
rm -i file.txt    # Prompt before removing a file
rm -f file.txt    # Force remove a file without prompting
rm -r directory   # Remove a directory and its contents

It's important to use the rm command cautiously, especially with the -r and -f options, as it can lead to the irreversible deletion of files and directories. Double-checking and verifying the files to be removed or using the -i option to prompt for confirmation are good practices to avoid accidental data loss.

💻 Creating and Removing Directories:

Bioinformatics projects often involve creating a directory structure to organize raw data, processed data, scripts, and results. In Unix, the mkdir command helps create project directories with the specified names. The basic syntax for the mkdir command is as follows:

mkdir directory_name  # Create a new directory

Whereas the mkdir command is used to create new directories, the rmdir command removes empty directories.

rmdir directory_name  # Remove an empty directory

It's important to note that the rmdir command can only remove directories that are empty. If a directory contains files or subdirectories, the rmdir command will not work unless the -p option is used.

Text within this block will maintain its original spacing when published

                                                                  ◍◍◍◍◍

🦠 Viewing and Editing Files:

💻 Displaying The Contents Of a File:

In Unix, the cat command is used to concatenate and display the contents of files. Bioinformaticians often use the cat command to quickly inspect data files or analyses' outputs. Additionally, when working with bioinformatics data, there may be a need to concatenate the contents of multiple files. The cat command can be used to combine files into a single stream. In the code block below, you'll learn the basic syntax for the cat command:

cat file_name.txt       # Display the contents of a file
cat -n file_name.txt    # Display contents of file w/ line numbers
cat f1.txt f2.txt # Concatenate & display the contents of multiple files
cat f1.txt f2.txt > new_fule.txt # Concatenate files and create new file

It's worth mentioning that while the cat command is handy for certain tasks, in more complex scenarios, Unix commands like less or more may be more appropriate for viewing and navigating through large datasets since they allow users to navigate through the content one screen at a time, making it easier to read and search through extensive datasets. Additionally, both commands provide a way to view text files without loading the entire file into memory, which can be important for large bioinformatics datasets.

In Unix, the more command displays the contents of a file one screen at a time. After displaying a screen, it waits for user input to continue to the next screen or quit the display. The basic syntax is as follows:

more file_name.txt
# To display the next line press enter
# To display the next screen press space
# To quite the display press q

The less command is an improved version of more. It provides more features and allows for both forward and backward navigation through the file. The basic syntax is as follows:

les file_name.txt
# To display the next line press enter
# To display the next screen press space
# To quite the display press q
# To search forward for a specific pattern type /pattern
# To search backward for a specific pattern type ?pattern

💻 Text Editor For Creating and Modifying Files

The nano command in Unix is a text editor commonly used in the terminal for simple text editing tasks. It is designed to be user-friendly and is especially suitable for users who may not be familiar with more advanced text editors like vim. The nano command provides a basic and straightforward interface for creating and editing text files, which you can open with the code below:

nano file_name

When you open a file with nano, you are presented with a text editor interface within the terminal. The bottom of the screen displays various commands that you can use for different operations, as demonstrated below:

Bioinformatics tools and software often use configuration files, and the nano command provides a simple way to edit these files directly in the terminal. Additionally, bioinformaticians may use the nano command for quick edits and modifications. However, advanced editors like vim or IDEs are preferred for more extensive coding tasks.

Text within this block will maintain its original spacing when published

                                                                  ◍◍◍◍◍

🦠 Searching and Filtering:

💻 Searching For Patterns In Files:

The grep command in Unix is a powerful tool for searching and matching patterns within text files. In bioinformatics, the grep command is used to search for and extract relevant information from text-based data files efficiently. Additionally, it plays a crucial role in tasks ranging from sequence analysis to quality control and data exploration. The code block below demonstrates the basic syntax for grep:

grep pattern file # Basic syntax
grep -i pattern file # Case-insensitive search
grep -n pattern file # Display line numbers with matched line(s)
# pattern refers to the tex expression you wish to search for
# file refers to the name of the file(s) to search for the pattern in

When you run the grep command, it scans the specified file(s) line by line, searching for the specified pattern. If a line contains the specified pattern, grep will print that line to the terminal. In effect, the grep command works similarly to regex in Python, which I covered in a previous article titled, Identifying DNA Binding Motifs With Regular Expressions.

💻 Filtering and Transforming Data:

The awk in Unix command is a versatile tool for pattern scanning and text processing. In essence, the awk command utilizes a programming language designed for processing and analyzing text data. The basic syntax for awk is as follows:

awk 'pattern { action }' file_name
# pattern specifies the condition for executing the action
# action specifies the action to performed when the pattern is matched

The awk command reads the specified file line-by-line to evaluate the specified pattern. If the pattern is true, then the association action is performed. The sed command is similar, but it performs search-and-replace operations instead of performing a specified action. The basic syntax for the sed command is as follows:

sed 's/pattern/replacement/' file_name
# pattern species the text pattern to search for
# replacement species the text to replace the matched pattern

Both awk and sed are useful tools in bioinformatics for text processing and manipulation. They provide bioinformaticians with the flexibility to extract, transform, and clean data efficiently, facilitating various aspects of data analysis and interpretation.

Text within this block will maintain its original spacing when published

                                                                  ◍◍◍◍◍

🦠 Pipelines and Redirection:

💻 Creating A Pipeline:

The |(pipe) command in Unix combines the output of one command and uses it as the input for another command, allowing you to chain multiple commands together, creating a pipeline for data processing. The basic syntax for the pipe command is as follows:

command_1 | command_2

In the code example above, the first command’s (command_1) output is used as the input for the following command (command_2). The second command then processes the output from the first command. The type of modularity the pipe command provides allows bioinformaticians to create efficient and flexible workflows for data processing, filtering, and integrating various tools.

💻 Input and Output Redirection:

In Unix, the > and < symbols are used for input and output redirection, allowing you to control where the input for a command comes from or where the output of a command goes. The > symbol redirects the standard output of a command to a file, and it creates or overwrites the specified file with the output of the command. The syntax for > is as follows:

command > output_file # Generic syntax
echo "Hello, World!" > output_file.txt # Specific example
# command refers to the command whose output you wish to redirect
# output_file refers to the file which the output will be written

The < symbol, on the other hand, is used to redirect the standard input of a command from a file. It takes the contents of the specified file and provides it as input to the command. The syntax for this command is as follows:

command < input_file # Generic syntax
grep "pattern" < input_file.txt # Specific example 
# command refers to the command that will take input from the file
# input_file is the file from which the input is read

Output and input redirection are fundamental concepts in Unix, and they play a crucial role in bioinformatics for managing data and results efficiently within various analysis workflows. For example, Bioinformatics analyses often produce results that need to be saved. The > symbol is used to redirect the output of tools and commands to files, allowing bioinformaticians to store and analyze results, as demonstrated below:

bioinformatics_analysis_result > results.txt

Text within this block will maintain its original spacing when published

                                                                  ◍◍◍◍◍

🦠 Downloading and Exploring Files:

In this last section, I will show you how to download the contents of files with Unix; then, I'll incorporate concepts from earlier in this newsletter to show you how to view and explore the contents of said files. The file we'll download contains genomic data for Saccharomyces cerevisiae, also known as baker's yeast. To start, we'll want to create a new directory to house the Saccharomyces cerevisiae genomic data, as demonstrated below:

conda activate bioinfo # Active conda environment
mdkir new_analysis # Make new directory called new_analysis
cd new_analysis # Change directory to new_analysis

Next, I’ll download Saccharomyces cerevisiae’s genomic data from the Saccharomyces Genome Database (SGD). After navigating to the website and finding the file I want to download, I select “copy link”, as demonstrated in the image below:

Next, I use Unix’s wget command, which allows you to download files from the internet using the HTTP, HTTPS, or FTP protocols. It is a non-interactive command-line tool, meaning you provide the URL of the file you want to download, and wget retrieves it for you. In the code block below, i’ll demonstrate the wget command with my chosen file’s URL:

# Store URL is variable 
URL=http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab

# Download URL with wget
wget $URL

#Alternatively, you can use the curl command
curl $URL > SGD_features.tab

The SGD_features.tab file containing Saccharomyces cerevisiae’s genomic data is now downloaded to my directory called new_analysis2. To view the contents of this file one page at a time, I can use Unix’s more command as demonstrated below:

more SGD_features.tab

Which produces the following output:

Now, let’s say I wanted more information on the TEL01L, a telomeric region on the left arm of chromosome 1 in Saccharomyces cerevisiae. To retrieve this information, I can use the following code:

cat SGD_features.tab | grep TEL01L --color=always

The code above uses the cat command to concatenate and display the contents of our file named SGD_features.tab. The | (pipe) command is then used to take the output from the cat command and pass it as input to the next command (grep TEL01L). The grep command is then used to search for lines in the input that contain the pattern TEL01L. Finally, the —-color=always command is used to highlight the matched pattern, making it easier to spot visually. The code above then produces the following output:

Finally, If I wanted to save the information display above In a new file I cause use the > (redirection) command in the following manner to produce a new file called TEL01L.tab as demonstrated below:

cat SGD_features.tab | grep TEL01L --color=always > TEL01L.tab

🧫 Want To Learn More? Check Out The Following Related Newsletters!

A DIY Guide to Differential Gene Expression Analysis

Evan Peikon

December 2, 2023

Read full story

Engineering Biological Circuits: The Future of Wearable Biotechnology

Evan Peikon

November 26, 2023

Read full story

An In-Depth Look At Data Preparation For Machine Learning

Evan Peikon

June 17, 2023

Read full story

Unix Fundamentals For Bioinformatics

A Beginner's Guide to Unix for Bioinformatics

🦠 Unix Fundamentals For Bioinformatics

🦠 Navigating The Terminal and File System

💻 The Terminal:

💻 List Files and Directories:

💻 Changing Directories:

💻 Print Current Working Directory:

🦠 Working With Files and Directories:

💻 Copy Files or Directories:

💻 Move, Rename, and Remove Files or Directories:

💻 Creating and Removing Directories:

🦠 Viewing and Editing Files:

💻 Displaying The Contents Of a File:

💻 Text Editor For Creating and Modifying Files

🦠 Searching and Filtering:

💻 Searching For Patterns In Files:

💻 Filtering and Transforming Data:

🦠 Pipelines and Redirection:

💻 Creating A Pipeline:

💻 Input and Output Redirection:

🦠 Downloading and Exploring Files:

🧫 Want To Learn More? Check Out The Following Related Newsletters!

A DIY Guide to Differential Gene Expression Analysis

Engineering Biological Circuits: The Future of Wearable Biotechnology

An In-Depth Look At Data Preparation For Machine Learning

Discussion about this post