Identifying DNA Binding Motifs With Regular Expressions
Finding Cra binding motifs in the E. coli genome with regular expressions
Decoding Biology is a DIY guide to bioinformatics and biomedical data science read by researchers at Harvard, MIT, and the NIH. If you enjoy reading Decoding Biology, please share it with a friend.
â—ˆ â—ˆ â—ˆ
🧬 Introduction to Regex 💻
Regular expressions, or Regex, are powerful search pattern tools that offer a flexible and efficient way to analyze biological data in text format. Additionally, regular expressions simplify complex pattern-matching tasks, enabling researchers to extract, analyze, and interpret information from diverse and extensive datasets. For example, in bioinformatics, DNA, RNA, and protein sequences can be represented as strings, and regular expressions concisely express search patterns within these sequences. Regular expression also enable the extraction of specific details, such as gene names, accession numbers, or protein sequences, from large biological data files stored in formats such as FASTA and GenBank. Below i’ll demonstrate some of the most basic Regex functions you may encounter:
In the code above, regex = 'CAC' defines the regex parameter, and dna_seq='…etc.' defines the search space. Next, the re.findall() function takes the regex parameter and search space as inputs and finds all occurrences of the regex parameter. The second time I call the re.findall() function, I give it the following input: (r'C.C', dna_seq). The dot in 'C.C' means any, so the regex parameter can take the form of CAC, CTC, CCC, or CGC. Finally, the function re.sub() substitutes one regex parameter for another in every instance it appears within the search space.Â
You'll notice that the re.findall() function simply finds all the occurrences of the regex parameter without providing indices. In the code example below, i'll show you how to find specific locations for each regex parameter occurrence within the search space:
In the examples above, I defined a regex parameter, which works well for one-off use cases. However, if you plan to reuse a regex parameter in future code, its worth compiling it into a regex object. When you compile a regex parameter, Python translates the pattern into a more efficient internal representation, resulting in faster matching than directly using the re module’s functions with the parameter as a string. In the same below, i’ll show you how to compile a regex parameter:Â
In the code above I define my regex parameter (regex = 'CAC') followed by two different search spaces, seq1 and seq2. I then use the re.compile() function to compile my regex parameter.
â—ˆ â—ˆ â—ˆ
🧬 Identifying Binding Motifs With Regex 💻
A DNA binding motif is a short, specific sequence of nucleotides that is recognized and bound by a regulatory protein, such as a transcription factor. Binding motifs play an important role in gene regulation since they are the binding sites for proteins that control gene transcription, and as a result, understanding them can help us unravel the complexity of gene regulation.
Furthermore, by identifying binding motifs and their corresponding regulatory proteins, researchers can gain insights into the control mechanisms governing various cellular processes, such as development and environmental responses.
In the sample code below I’ll show you how to use regular expressions to identify DNA binding motifs in a bacterial genome. Specifically, I'll demonstrate how to find the location of binding motifs for the Cra gene in Escherichia coli, which is depicted below1:
Cra, also known as the catabolite repressor/activator, is a protein that regulates carbon metabolism and the utilization of sugars as food sources in bacteria. Understanding the regulatory mechanisms controlled by proteins like Cra is important for elucidating bacteria's metabolic strategies in various environments. For example, the ability to finely tune gene expression based on nutrient availability contributes to the adaptability and survival of bacteria like E. coli in diverse ecological niches.
The code above is designed to find occurrences of a specific DNA pattern represented by the regular expression r'TGAATCG.TT..', chosen based on the Cra binding consensus depicted above. Note the usage of dots in positions 8, 11, and 12, which allow for greater flexibility in pattern matching since it's unclear what nucleotides should be in those positions. Next, I'll break the code above down line-by-line to show you how it works:
import re
ecoli_record, = SeqIO.parse('file path...etc.', 'fasta')
ecoli_genome_seq = ecoli_record.seq
The block of code above is used to import libraries and load the genome sequence we’ll be working with:
The first line of code, import re, imports Python’s regular expression library. Next, ecoli_record,= SeqIO.parse(...etc.), uses Biopython’s SeqIO module to parse the FASTA file containing the E. coli genome sequence. Then, the code ecoli_genome_seq=ecoli_record.seq uses the seq() function to extract the E. coli DNA sequence and store it in the parameter called ecoli_genome_seq.Â
Cra = re.compile(r'TGAATCG.TT..')
Cra_matches = []
This next block of code is used to define the regex pattern (i.e., DNA binding motif) we’ll later search for:
Cra = re.compile(r'TGAATCG.TT..') defines a regulator expression pattern based on the consensus Cra binding motif, using the re.compile() function. Then, an empty list called Cra_matches is created to store the sequence matches from the following for loop.Â
for strand in ['+', '-']:
if strand == '+':
strand_sequence = ecoli_genome_seq
else:
strand_sequence = ecoli_genome_seq.reverse_complement()
for match in Cra_pattern.finditer(str(strand_sequence)):
Cra_pattern_matches.append((strand, match.start(), match.end(), match.group()))
The third major block of code is used to search for the defined regex pattern in the E. coli genome and store the match information:
First, a for loop is created to iterate over the forward and reverse complement strands of the E. coli genome, as denoted by the following code: for strand in ['+', '-']:. Next, an if statement is used to state that if the strand is in the forward strand, then the strand sequence (strand_sequence) is set to the original sequence (ecoli_genome_seq). Alternatively, if the strand is the reverse complement strand, then the strand sequence (strand_sequence) is set to the reverse complement (ecoli_genome_seq.reverse_complement()).Â
Next, the code for match in Cra_pattern.finditer(str(strand_sequence)): uses the finditer function to search for non-overlapping matches for the DNA binding motif in the current strand sequence.Â
Then, Cra_pattern_matches.append((strand, match.start(), match.end(), match.group())) stores the Cra_pattern_matches as tuples containing the start position, end position, strand, and matched sequence (i.e., the DNA binding motif).Â
print('Start-End, (Strand), Sequence')
for strand, start, end, seq in Cra_pattern_matches:
print('%d-%d (%s): %s' % (start, end, strand, seq))
Finally, the last block of code is to display the results of our search:
The print statement is used to create a header for our results. Then, a for loop iterates over our stored matches, and for each match it prints the start and end positions in the E. coli genome sequence, whether it's on the forward or reverse strand, and the specific matched sequence (remember, since we used the dot notation in the regex not all matches will be identical).Â
This sample code above is useful for identifying occurrences of a specific DNA motif or pattern in the genome of E. coli. The Cra pattern represents a specific binding site, or motif, and the code is designed to find instances of this motif in both the forward and reverse complement strands. This kind of analysis is common in bioinformatics, especially when studying gene regulation, transcription factor binding sites, or other functional elements in genomic sequences.
â—ˆ â—ˆ â—ˆ
Want To Learn More? Check Out The Following Related Newsletters!
You can download the entire genome sequence of the K-12 strain of E. Coli as a FASTA file on the NCBI website.