Introduction to Python For Bioinformatics
Foundational skills for aspiring bioinformaticians.
An ask: If you liked this piece, I’d be grateful if you’d consider tapping the “heart” 🖤 in the header above. It helps me understand which pieces you like most and supports this newsletter’s growth. Thank you!
🧬 Introduction to Bioinformatics
Bioinformatics is an interdisciplinary field that merges biology, computer science, and data analysis to extract valuable information from biological data. Additionally, bioinformatics is playing an increasingly important role in the biomedical field, genetics, and other biological research domains. In this article, we will explore what bioinformatics is and the most basic Python coding skills necessary for aspiring bioinformaticians.
🧬 What is Bioinformatics?
Bioinformatics involves using computational techniques to collect, process, and analyze biological data. These data sources may include DNA and protein sequences, gene expression profiles, structural data of biomolecules, and more. Bioinformatics techniques are applied to extract insights, discover patterns, and answer biological questions. Some key areas of bioinformatics include sequence analysis, where bioinformaticians use algorithms to align, compare, and annotate DNA, RNA, and protein sequences, and genomics, where bioinformatics is paving the way for predicting a gene's location in the genome and performing functional annotation of genes, revealing their molecular functions.
🧬 Basic Python Coding Skills for Bioinformatics
Python is one of the most popular programming languages in bioinformatics due to its simplicity, versatility, and extensive libraries and tools. In bioinformatics, Python is used for data analysis, scripting, and automating repetitive tasks. Additionally, Python's readability and ease of use make it a preferred choice for both beginners and experts in the field.
With a solid foundation in Python coding, aspiring bioinformaticians can unlock the potential to explore the vast realm of biological data, contribute to scientific discoveries, and address complex biological questions through computational analysis. One of the most basic Python coding skills essential for bioinformatics is manipulating lists, dictionaries, and strings. You'll also need a basic understanding of the most common algorithms used in bioinformatics. This article will cover these foundational skills before building upon them in future newsletters.
🧬 Data Manipulation with Lists and Dictionaries
In bioinformatics, you'll often work with biological data stored in lists and dictionaries. Understanding how to extract, manipulate, and store data in these data structures is crucial. For example, a common task involves translating a DNA sequence to a protein sequence using a codon table. This is a simple task that can, in theory, done manually, but when working with biological datasets becomes untenable. In the example below, i’ll show you how to create a loop that processes a DNA sequence and translates it into a protein sequence using codons:
To start, I translate the DNA sequence into an RNA sequence, which simply entails swapping the T’s (thymine) in the DNA sequences for U’s (uracil). I then create an empty list labeled protein_sequence, to which we’ll add our proteins with our ‘for loop’. The ‘for loop’ consists of the following code:
for i in range(0, len(rna_sequence),3):
codon = rna_sequence[i:(i + 3)]
aa = codon_table[codon]
if aa == '*':
break
else:
protein_sequence.append(aa)
Now, lets break the code above down one step at a time to better understand how it works:
for i in range(0, len(rna_sequence),3): initiates the for loop that iterates through the RNA sequence in steps of three nucleotides. I specifically chose the number three because genetic codons are made of three nucleotides.
codon = rna_sequence[i:(1+3)] means that within each iteration, i, a substring of the RNA sequence is extracted. Each substring consist of the nexts tree nucleotides after the current position, which is done to obtain a single codon.
aa = codon_table[codon] looks up the codon table, which is a dictionary that associates codons with their corresponding amino acids. For example, if we have the codon UUU we can see it corresponds to the amino acid F, which is phenylalanine.
if aa==’*’ is a conditional statement that checks whether the amino acid assigned to aa is a stop codon, which indicates the end of gene translation in genetic code.
break is used to exit the loop and is triggered if the current codon is a stop codon. This ‘breaks’ the gene translation process.
protein_sequence.append(aa) adds the codon to the protein_sequence list, which is used to accumulate the amino acids translated from the DNA sequence.
In the example above, the ‘for loop’ continues until it encounters a stop codon, at which point it breaks the loop, and the translation process is complete, or the sequence ends. After this process is done, the protein_sequence list contains the sequence of amino acids translated from the DNA sequence, and it can be used to represent the resulting protein sequence.
🧬 String Manipulation
In bioinformatics, biological data such as DNA and RNA sequences are often represented as strings. As a result, knowing how to manipulate and search for patterns within strings is an essential skill. For example, you may have to calculate the frequency of a given amino acid in a protein sequence or find the most frequently occurring protein, as demonstrated below:
🐍 Example 1 — Calculating The Frequency Of An Amino Acid
In the sample code below i’ll show you how to count the occurrence of each amino acid in a protein sequence and store the counts in a dictionary called aa_counter:
Now, lets break the code above down one step at a time to better understand how it works:
aa_counter = {} is an empty dictionary that we will use to store the counts of each amino acid in our protein sequence.
for aa in protein_sequence: creates a for loop that iterates through each amino acid in the list labeled protein_sequence one by one.
if aa in aa_counter: is a conditional statement within the for loop that checks if the current amino acid aa is already a key in the aa_counter. If so, it means the amino acid has been encountered before and we continue to the next step below.
aa_counter[aa]+=1 is activated if the amino acid aa is already in the aa_counter dictionary. In this case the count associated with that specific amino acid increases by one, which creates a running tally of how many times that amino acid has appeared in the protein sequence.
else: is triggered if the amino acid aa is not already in the aa_counter dictionary. In this case the following portion of the code is executed:
aa_counter[aa]=1, which creates a new key in the aa_counter dictionary and assigns it a count of one.
After counting the occurrence of each amino acid in a protein sequence and storing the counts in a dictionary called aa_counter, we can determine the most common amino acid in the sequence. To do this, we can use the sample code below:
Now, analyze the code above one step at a time:
most_common_aa = None is used to store the most frequent amino acid in our protein sequence.
for aa in aa_counter is used to initiate a for loop, which iterates through each key (i.e., amino acid) in the aa_counter dictionary. In this loop, aa represents the current amino acid being considered.
if most_common_aa is None or aa_counter[aa] > aa_counter[most_common_aa] is an if statement within the for loop that checks the following two conditions:
if most_common_aa is None checks if the most_common_aa is still None, which will only be true for the first iteration in the loop.
The second condition, aa_counter[aa] > aa_counter[most_common_aa], checks if the count of the current amino acid aa is greater than the current most_common_aa’s count.
most_common_aa=aa is triggered if the previous two conditions are true, which in that case the current amino acid (aa) is the most common amino acid.
The above loop continues until it has iterated through all of the amino acids stored in the aa_counter dictionary, effectively determining the most common amino acid in the protein sequence and the number of times it occurs.
🐍 Example 2 — Finding An Open Reading Frame
Before we move forward, I want to show you another common example where you’ll need to know how to search for patterns within strings. In the code below, I’ll show you how to find possible open reading frames (ORFs) by searching for start codons (ATG) in a DNA sequence and translating the sequences downstream of these start codons into amino acids until a stop codon is found.
An open reading frame (ORF) is a stretch of DNA or RNA that contains a sequence of nucleotides that can potentially be translated into a protein. In simpler terms, it's like a "start" and "stop" signal for a protein-making process. An ORF is the portion of genetic code that can be used to protein a protein when a cell reads it. ORFs are important in genetics and molecular biology because they help identify genes and their functions within a genome.
To start, we can use the following code to find the location of the start sites:
Essentially, I’m doing in the code above is doing is creating an empty list called start_sites, which will house the positions of ATG start codons in the DNA sequence. I then create a for loop to iterate over the DNA sequence and check if a start codon is found at the current position, i, in the DNA sequence. If a start codon is found, then the position, i, is added to the start_sites list using start_sites.append(i). Next, I use the code below to calculate the length of each DNA sequence and their reading frames (1, 2 or 3).
Now, before we review the output from the code, let's break the code above down one step at a time to understand how it works:
First, we have a list called start_sites that stores the position of the ATG start codons in our DNA sequence.
Next, we have a for loop that states for start_site in start_sites:, which iterates over the positions stores in start_sites.
protein_sequence=[] is nested within this first for loop, and initializes an empty list that will store the translated amino acid sequences for a particular open reading frame.
The next for loop, for i in range(start_site, len(dna_sequence), 3):, iterates through the DNA sequence starting from an identified start codon position stored in start_site list and in steps of three to read codons.
codon = rna_sequence[i:(i+3)] then extracts a three nucleotide codon from the RNA sequence and aa = codon_table[codon] loops up the corresponding amino acid in the codon table.
The code, if aa==’*’:, then checks if the amino acid is a stop codon. If so, the loop is broken and the translation process ends.
Alternatively, if the codon represents an amino acid it is added to the protein sequence list with the following code: protein_sequence +=aa
protein_sequence=’ ‘.join(protein_sequence) then converts the list of amino acids into a single string, which represents the translated protein sequence for a specific open reading frame.
The code then prints then reports information about each identifies open reading frame in the inputted DNA sequence.
The output from the previous code, shown below, provides information about the identified open reading frames (ORFs) in the input DNA sequence:
Let's break down the results step by step:
First, we can see that the DNA sequence has four possible start sites at 4, 44, 104, and 317. These start sites indicate the location of a start codon and a potential open reading frame. To better understand how we identify these start sites, let’s look at the first 20 nucleotides in our DNA sequence:
We can see that there is an ATG start codon present in position 4, which signals out first start site and open reading frame.
Next, we can see that this first ORF contains the protein sequence MRERQY. We derive the protein sequence by looking at the nucleotides in our ORF and translating them with the codon table. For example, you’ll notice that all of our protein sequences start with the letter M. The reason is that the nucleotide sequence ATG codes for the amino acid methionine.
To derive our whole protein sequence, we work through all of the amino acids in our open reading frame, which starts with a start codon and ends with a stop codon in pairs of three. For example, the following letter in our first protein sequence is R because the amino acid sequence CGA codes for the amino acid arginine.
Next, we see that each row of our output shows the length of our DNA sequence and how many amino acids it was translated into.
The number of amino acids can easily be calculated from the length of our protein sequence. For example, the protein sequence MRERQY has 6 amino acids.
Next, we can determine the length of our DNA sequence by subtracting the starting position of our ORF from the number of amino acids in our protein sequence. For example, the protein sequence MRERQY contains 6 amino acids and starts at position 4 in the DNA sequence. Thus, the length = 6-4 = 2.
Finally, each row of our output categorizes our open reading frame as frame 1, 2, or 3. This can be determined with the code (start_site%3)+1.
If (start_site % 3) equals 0, then (start_site % 3)+1 =1, indicating the first reading frame.
If (start_site % 3) equals 1, then (start_site % 3)+1 =2, indicating the second reading frame.
If (start_site % 3) equals 3, then (start_site % 3)+1 =3, indicating the third reading frame.
🧬 Basic Bioinformatics Algorithms
After reading this article you should understand the basic Python coding skills needed to begin exploring biological data and answering complex biological questions through computational analysis.
The last thing you’ll need to begin working on bioinformatics projects is a foundational knowledge of basic algorithms like sequence alignment (i.e., BLAST, Smith-Waterman, etc.), clustering, and phylogenetic tree construction, which will be the topics of future articles.