Discovering Reverse Complements in DNA Sequences
Finding Reverse Compliments In DNA Sequences With Python
An ask: If you liked this piece, I’d be grateful if you’d consider tapping the “heart” 🖤 in the header above. It helps me understand which pieces you like most and supports this newsletter’s growth. Thank you!
🧬 What Is The Reverse Compliment Of A DNA Sequence?
In molecular biology, the reverse compliment of a DNA sequence is a sequence derived by reversing the order of the DNA sequence and then replacing each nucleotide with it’s complimentary base according to standard base-pairing rules. For example, lets take the following DNA sequence:
A A T G C C A T G C A G
If we were to reverse this DNA sequence, we’d end up with the following:
G A C G T A C C G T A A
Now, to take the compliment of this reverse DNA sequence we pair the adenine (A) nucleotides with thymine (T), thymine with adenine, cytosine (C) with guanine (G), and guanine with cytosine, resulting in the following:
C T G C A T G G C A T T
The string of nucleotides above is the reverse compliment of our starting DNA sequence, A A T G C C A T G C A G.
🧬 What Is The Purpose Of Finding Reverse Compliments Of DNA Sequences?
I’ve now defined reverse compliments and have demonstrated how you can find the reverse compliment of a DNA sequence. However, that still leaves the question as to why one would care to do so.
There are several reasons why a bioinformatician may find the reverse compliment of a DNA sequence and then translate it into a protein sequence. For example, one common use case in in gene prediction. Many genes have specific start and stop codons, and they are often found in one of the six reading frames (the three forward frames and their reverse complements). By searching both the forward and reverse complements of the DNA sequence, bioinformaticians can identify potential genes that might be missed if only the forward strand is considered.
Additionally, bioinformaticians may want to find the reverse compliment of a DNA sequence to identify it’s open reading frames and coding regions. To do so, it’s essential to examine both strands of DNA since either can encode for functional proteins. Furthermore, by translation both the forward and reverse compliments, you can ensure you don’t miss protein protein-coding sequences.
A bioinformatician may even want to find the reverse compliment of a RNA sequence to discover non-coding RNAs. While coding sequences are crucial, non-coding RNAs (ncRNAs) also play essential roles in gene regulation. Some ncRNAs are located on the complementary strand and as a result translating both strands can help identify such elements.
To perform the aforementioned tasks, bioinformaticians often use bioinformatics tools and programming languages like Python to find the reverse complement of a DNA sequence, identify ORFs, translate codons into amino acids, and analyze the potential functions of the resulting protein sequences. By considering both strands and frames, they gain a more comprehensive view of the genetic information contained within a DNA sequence. The remainder of this article will be dedicated to showing you how you can use Python to automate these tasks.
🧬 Finding Reverse Compliments Of DNA sequences In Python
For the first tutorial in this article, I’ll show you how to convert a DNA sequence into its reverse compliment and then translate the outputted DNA sequence into a protein sequence. To accomplish this, we first need to reverse the DNA sequence. For example, turning ATG into GTA. Next, we need to find the compliment of the reversed DNA sequence. In this example, the compliment of GTA is CAT. After we have the reverse compliment of our DNA sequence, we'll convert it to an RNA sequence. For example, converting CAT to CAU. Finally, we can use a codon table to determine what protein our RNA sequence codes for. In this example, CAU codes for the amino acid histidine.
As you can see, these tasks can easily be performed by hand. However, with a very long DNA sequence, this becomes unfeasible. As a result, we rely on programming languages such as Python for automating these types of receptive tasks.
🐍 Part I- Converting A DNA sequence To It’s Reverse Compliment
The code sample below will show you how to convert a DNA sequence to it’s reverse compliment:
Now, lets break the code above down one step at a time to better understand how it works:
DNA_seq = [….etc.]
This defines our DNA sequence as a long string of nucleotides (i.e., A’s, T’s, G’s, and C’s).
Complimentary_Basepairs = {'A':'T', 'C':'G', 'G':'C', 'T':'A',}
This creates a dictionary that maps each nucleotide to it’s complimentary base, using standard base-pairing rules.
reverse_compliment_DNA_seq =[]
This creates an empty list that will be used to store the nucleotides of the reverse compliment of our DNA sequence.
for nucleotide in in DNA_seq[::-1]:
reverse_compliment_DNA_seq.append(Complimentary_Basepairs[nucleotide])
First, a for loop is used to iterate over the DNA sequence in reverse order. Then, inside the for loop we have the following:
Complimentary_Basepairs[nucleotide] retrieves the complimentary base for the current nucleotide based on the ore-defined dictionary. It is then appended to the reverse_compliment_DNA_seq list.
In effect, this for loop finds the reverse compliment of the DNA sequence.
reverse_compliment_DNA_seq= ''.join(reverse_compliment_DNA_seq)
After the for loop is finished, we can use the code above to convert reverse_compliment_DNA_seq from a list to a string.
print(….etc.)
Finally, the code prints the reverse compliment of the DNA sequence using a print statement.
In summary, the code above takes a DNA sequence, iterates through it in reverse order, and constructs the reverse complement by looking up the complementary base for each nucleotide in the provided dictionary. The reverse complement is then printed as the output.
🐍 Part II- Transforming The Reverse Compliment Of A DNA sequence Into A Protein Sequence
After converting our DNA sequence to its reverse compliment using the code above, we can find the protein sequence it codes for. To accomplish this, we first need to convert our reverse compliment DNA sequence into an RNA sequence; then, we can translate that RNA sequence into a protein sequence. The code sample below shows you how:
The code sample above continues from where the previous code, in part I, left off and is responsible for converting the reverse complement of a DNA sequence into an RNA sequence and subsequently translating that RNA sequence into a protein sequence. Now, lets break the code above down one step at a time to better understand how it works:
reverse_complement_RNA_seq=reverse_complement_DNA_seq.replace('T', 'U')
The code above converts the reverse compliment of a DNA sequence into an RNA sequence by replacing all of the 'T' (thymine) nucleotides with 'U' (uracil).
codon_table={….etc.}
This code creates a dictionary, which serves as a look-up table for translating RNA codons into amino acids. Specifically, the codon table maps each RNA codon to its corresponding amino acid.
protein_sequence=[]
This code creates an empty list to contain the amino acids that will be translated from the RNA sequence with the following for loop.
for i in range(0, len(reverse_complement_RNA_seq), 3):
First, a for loop is created, which iterates over the reverse compliment RNA sequence. The for loop iterates over the reverse compliment RNA sequence in steps of 3.
codon = reverse_complement_RNA_seq[i:(i + 3)]
This code extracts a 3-nucleotide codon from the RNA sequence.
aa = codon_table[codon]
The previously mentioned 3-nucleotide codon is then used as a key in the codon_table dictionary, resulting in the extraction of its corresponding amino acid.
if aa == '*':
break
If the amino acid extracted from the codon table is a stop codon, the for loop is terminated and translation is complete.
else:
protein_sequence.append(aa)
If the amino acid extracted from the codon table corresponds to an amino acid, it is appended to the protein sequence list.
protein_sequence = ''.join(protein_sequence)
After the for loop is completed, the protein sequence list is converted to a string using the code below. This string represents the translated protein sequence.
print(…etc.)
Finally, the code prints the resulting protein sequence, which is the translation of the reverse complement of the given DNA sequence into protein.
In summary, the code sample above performs the essential steps of RNA transcription (i.e., converting a DNA sequence to an RNA sequence) and translating an RNA sequence into a protein sequence using a predefined codon table.