Exploring Codon Frequency and Unraveling Usage Bias in DNA sequences
Identifying Codon Frequency And Usage Bias In DNA With Python
An ask: If you liked this piece, I’d be grateful if you’d consider tapping the “heart” 💙 in the header above. It helps me understand which pieces you like most and supports this newsletter’s growth. Thank you!
🧬 What Is A Codon?
A codon is a DNA or RNA sequence of three nucleotides that encodes for a specific amino acid, signals the beginning of protein translation, or terminates the translation process. Whereas nucleotides are the building blocks of DNA and RNA, codons are the building blocks of genetic code, which are the rules that determine how the information encoded in DNA and RNA is translated into a protein sequence. For example, the DNA sequence ACT corresponds to the RNA sequence ACU, which codes for the amino acid threonine, one of 64 possible codons.
If you take any sufficiently long DNA sequence, you'll find a disparity in the frequency of codons use. Codon frequency analysis has essential uses in molecular biology and genetics. For example, deviations from the expected codon usage may provide insights into the function and evolution of genes. Additionally, codons can have usage biases. For example, let's take the amino acid threonine, encoded by the following RNA codons: ACG, ACA, ACU, ACC. If we had a dozen threonines in a protein sequence, we might expect to see each of the codons mentioned above thrice. However, some codons appear more or less frequently than expected, given the frequency of the amino acids they code for. Understanding codon usage and deviations from the expected frequencies can provide valuable insights into an organism's genetic code, gene expression, and evolutionary history. It also has practical applications in fields like genetic engineering and synthetic biology.
In this article, we'll explore using Python to help us uncover codons' frequency and usage bias within a DNA sequence.
🧬 Codon Frequency
Codon frequency refers to the number of times a given codon, or nucleotide triple (i.e., ATC), occurs within a given DNA or RNA sequence, and it measures how often each codon appears relative to the total number of codons in the sequence.
A high codon frequency for a specific codon implies that it is used more often for coding a particular amino acid, while a low codon frequency indicates the opposite. Furthermore, analyzing codon frequency is valuable for understanding and optimizing gene expression in genetic engineering and biotechnology. For example, codon optimization is used in synthetic biology and genetic engineering to design DNA sequences for maximum protein expression in a host organism. By choosing codons that are more frequently used in a host organism's genome, researchers can thus enhance translation efficiency and protein production.
In the sample code below, I'll show you how to calculate codon counts and codon frequency within a given DNA sequence:
Now, lets break the code above down one step at a time to better understand how it works:
DNA_sequence = [….etc.]
codon_counts = {}
This section of code defines our DNA sequence as a long string of nucleotides (i.e., A’s, T’s, G’s, and C’s). Then, it creates a dictionary named codon_counts, which will store the counts of each unique codon in our DNA sequence (DNA_sequence).
for i in range(0, len(DNA_sequence), 3):
codon = DNA_sequence[i:(i + 3)]
if codon in codon_counts:
codon_counts[codon] += 1
else:
codon_counts[codon] = 1
This next section of code creates a for loop, which iterates over the DNA sequence in streps of three, effectively extracting each 3-nucleotide long codon from the DNA sequence. If the extracted codon is already in the codon_counts dictionary, then the code count increments the count by 1. If the extracted codon is not present in the codon_counts dictionary then this code initializes the count to 1.
total_codons = sum(codon_counts.values())
codon_frequencies = {}
This section of code sums the values in the dictionary labeled codon_counts, then assigns those to a new variable names total_codons. Then, a dictionary names codon_frequencies is created to store the frequencies of each unique codon in our DNA sequence.
for codon, codon_count in codon_counts.items():
codon_frequencies[codon] = codon_count / total_codons
print('Codon frequencies: %s' % codon_frequencies)
Then, another for loop is created to iterate over the codons and their counts stored in the codon_counts dictionary. For each codon, this code calculates its frequency by dividing its count by total_codons, then storing this frequency in the codon_frequencies dictionary. In this case the codon is the key in the dictionary and its frequency is the corresponding value. Finally, this code prints the calculated frequency of each codon.
In summary, the reviewed code above calculates the counts of individual codons within a DNA sequence, thencomputes the frequencies of codons codons relative to the total number of codons in the DNA sequence.
🧬 Codon Usage Bias
In many genomes you’ll observe codon usage bias, meaning that some synonymous codons, that code for the same amino acid, are used more frequently than others. For example, let’s take the amino acid leucine, which is encoded by the DNA codons TTA and TTG. If we have ten leucines in a protein sequence, we might expect to see each of the previously mentioned codons appear five times. However, some codons appear more or less frequently than expected, given the frequency of the amino acids they code for. This bias can result from natural selection and evolutionary forces. Additionally, some codons may be preferred due to translation efficiency, tRNA availability, or other factors.
In the sample code below, I'll show you how to observe codon usage bias within a given DNA sequence:
The provided code above calculates codon usage bias for a given DNA sequence. Here's how it works:
DNA_sequence = [....etc.]
def translate_codon(codon):
codon_table = {...etc.}
This section of code defines our DNA sequence as a long string of nucleotides (i.e., A’s, T’s, G’s, and C’s). It then creates a function called translate_codon() that takes a codon as input and returns the corresponding amino acid using a predefined DNA codon table, which is a dictionary that maps each codon to its corresponding amino acid.
def calculate_codon_usage_bias(DNA_sequence):
codon_counts = {}
for i in range(0, len(DNA_sequence), 3):
codon = DNA_sequence[i:(i + 3)]
amino_acid = translate_codon(codon)
if amino_acid not in codon_counts:
codon_counts[amino_acid] = {}
if codon in codon_counts[amino_acid]:
codon_counts[amino_acid][codon] += 1
else:
codon_counts[amino_acid][codon] = 1
codon_frequencies = {}
for amino_acid, counts in codon_counts.items():
total_counts = sum(counts.values())
codon_frequencies[amino_acid] = {codon: count / total_counts for codon, count in counts.items()}
return codon_frequencies
First, this code block begins by defining a function called calculate_codon_usage_bias, which takes a DNA sequence as input and returns a dictionary containing the codon usage bias. Then, inside the function, an empty dictionary named codon_counts is created to store the counts of each codon for each amino acid.
Next, a for loop is created to iterate through the DNA sequence in steps of 3 nucleotides. The translate_codon function is used for each codon to determine its corresponding amino acid. The code then checks if the amino acid is already in the codon_counts dictionary. If so, the count is incremented by one. If not, a new entry is created with a count of 1.
After iterating through the entire DNA sequence, the codon_counts dictionary contains the count of each codon for each amino acid. Following that, a new dictionary called codon_frequencies is defined to store the frequencies of each codon for each amino acid.
Another for loop is then used to calculate the codon frequencies for each amino acid. For each amino acid, the for loop calculates the total counts of all codons for that amino acid. Then, it computes the frequency of each codon by dividing its count by the total counts. The function then returns the codon_frequencies dictionary.
codon_usage_bias = calculate_codon_usage_bias(DNA_sequence)
for amino_acid, frequencies in codon_usage_bias.items():
print(f'Amino Acid: {amino_acid}')
for codon, frequency in frequencies.items():
print(f' Codon: {codon}, Frequency: {frequency:.2f}')
Finally, the calculate_codon_usage_bias function is called with the DNA_sequence as an argument, and the result is assigned to the variable codon_usage_bias. Then, an outer for loop is used to iterate through the items in the codon_usage_bias dictionary using the variables amino_acid and frequencies to represent each amino acid and its associated codon frequency. The code then prints the amino acid’s name using the print statement for each amino acid.
Next, the inner for loop is used to iterate through the items in the frequencies dictionary, which contains the codon-frequency pair for the current amino acid. For each codon, the code prints its name and its associated frequency, resulting in the following outputs:
The image above shows a sample of the output from the previous code block. The full output includes twenty-one different codons and their associated frequencies. Even with this limited snapshot, we can see codon usage bias in action. For example, we can see that the amino acid N (Asparagine) is encoded by the codons AAT and AAC. However, instead of a 0.5/0.5 split, we have the codon AAT occurring ~4.8x more than the codon AAC. This bias can result from natural selection and evolutionary forces, translation efficiency, tRNA availability, or other factors, and by uncovering codon usage bias and a DNA sample, we can begin to probe its origins.