Question

Python: regular expression searching error

2

Entering edit mode

3.7 years ago

jamie.pike ▴ 80

Can anybody work out why my python script is not working properly? I want to search through a FASTA looking for a specific sequence (CAGTGGG..GCAA[TA]AA and the reverse complement), then print the location of the sequence. However, the output only prints the location of the "forward" sequence and the location is incorrect (in both examples I'm sure it should be 2/3) and doesn't seem to identify both locations. I can't work out why. So far I have the following code:

#!/bin/python

#My Mimp_finder

import re

from Bio import SeqIO

   for seq in SeqIO.parse("Focub_mimps.fasta", "fasta"):
        print("\n\n")
        print(seq)
        hit = re.search(r"CAGTGGG..GCAA[TA]AA", str(seq))
        hit_rc = re.search(r"TT[TA]TTGC..CCCACTG", str(seq))
        if hit and hit_rc:
            print(f'complete mimp found at {hit.start()} to {hit_rc.end()}')
        elif hit:
            print(f'Partial forward mimp found at {hit.start()}')
        elif hit_rc:
            print(f'Partial rc mimp found at {hit_rc.end()}')
        else:
            print("No Mimps")

Here is an example of the FASTA (matching sequences in bold):

Focub_II5_mimp_1__contig_1.16(656599:656809) TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA Focub_II5_mimp_2__contig_1.47(41315:41540) TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA

The output for these sequences:

ID: Focub_II5_mimp_1__contig_1.16(656599:656809)
Name: Focub_II5_mimp_1__contig_1.16(656599:656809)
Description: Focub_II5_mimp_1__contig_1.16(656599:656809)
Number of features: 0
Seq('TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGA...GTA', SingleLetterAlphabet())
Partial forward mimp found at 187



ID: Focub_II5_mimp_2__contig_1.47(41315:41540)
Name: Focub_II5_mimp_2__contig_1.47(41315:41540)
Description: Focub_II5_mimp_2__contig_1.47(41315:41540)
Number of features: 0
Seq('TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTA...GTA', SingleLetterAlphabet())
Partial forward mimp found at 181

Why are only the "forward" mimp locations printed?

P.S. any recommendations for improvement or helpful tips will be greatly appreciated - I am just starting out.

Thanks,

Jamie

python Regular Expression Error BioPython SeqIO • 1.8k views

ADD COMMENT • link updated 3.7 years ago by Joe 21k • written 3.7 years ago by jamie.pike ▴ 80

0

Entering edit mode

According to your code, you're looking for the reverse complement in the non reverse complement.

ADD REPLY • link 3.7 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Apologies, I don't quite follow your point? I think it is printing the reverse complement location as the non-reverse complement location: "Partial forward mimp found at 181", but I don't see where the issue in the code is to cause this?

ADD REPLY • link 3.7 years ago by jamie.pike ▴ 80

0

Entering edit mode

I have now spotted that the code is:

 if hit and hit_rc:
            print(f'complete mimp found at {hit_rc.start()} to {hit_rc.end()}')

and have changed it to:

 if hit and hit_rc:
        print(f'complete mimp found at {hit.start()} to {hit_rc.end()}')

However the problem remains.

ADD REPLY • link 3.7 years ago by jamie.pike ▴ 80

0

Entering edit mode

When you do str(seq), you have a string containing the forward sequence. Searching the reverse complement in that string will not yield anything. You would need to reverse complement that string yourself (or compute the reverse complement from the BioSeq object and then convert that into a string) and search against that.

ADD REPLY • link 3.7 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Can you please provide the one or few sequences containing "TT[TA]TTGC..CCCACTG"?

ADD REPLY • link 3.7 years ago by Renesh ★ 2.2k

0

Entering edit mode

I believe I have fixed it, but I have provided some sequences in case you may be able to improve it further.

    >Focub_II5_mimp_1__contig_1.16(656599:656809)_Does_no_contain_CAGTGGG..GCAA[TA]AA
    CGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
    >Focub_II5_mimp_2__contig_1.47(41315:41540)_Does_not_conatin_TT[TA]TTGC..CCCACT
   TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCT
    >Focub_II5_mimp_3__contig_1.65(13656:13882)
    TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
    >Focub_II5_mimp_4__contig_1.70(61591:61809)
    TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
    >Focub_II5_mimp_5__contig_1.70(63407:63595)
    TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGGTAGATCTGATGCTCGGAAGCTAGTCAACGGTGGCTTGTCAAGTCCTTGATACCTTACATTACACTTCAATCAGGTGATATCCTAGTATTGAGTACCTAAGCTAAGCTCGGGCAATTTACACACCTGCGAATACTTTTTGCACCCCACTGTA
    >Focub_II5_mimp_6__contig_1.86(922:1140)
    TACAGTGGGATGCAATAAGTTTGAATCCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAATGCTTTAGCTTGATAAAGTCAATGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACATAGGAAACTGGTCTTACTTCGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA

ADD REPLY • link 3.7 years ago by jamie.pike ▴ 80

0

Entering edit mode

3.7 years ago

jamie.pike ▴ 80

I think fixed it!!!

I believe the issue, was with how the sequence was being processed. When using BioPythons module SeqIO to provide the sequence for re.search, no reverse complement was identified. However, when I use the module Bio.Seq to search the sequences the reverse complements are found and the code works. I'm not exactly certain why this works... but it does for now!! If anyone can explain, that would be helpful.

Fixed code:

import re

from Bio import SeqIO
from Bio.Seq import Seq

for sequence in SeqIO.parse("Focub_mimps.fasta", "fasta"):
    print("\n\n")
    print(str(sequence))
    hit = re.search(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
    hit_rc = re.search(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
    if hit and hit_rc:
        print(f'complete mimp found at {hit.start()} to {hit_rc.end()}')
    elif hit:
        print(f'Partial forward mimp found at {hit.start()}')
    elif hit_rc:
        print(f'Partial rc mimp found at {hit_rc.end()}')
    else:
        print("No Mimps")

Example output:

ID: Focub_II5_mimp_1__contig_1.16(656599:656809)_Should_report_partial_for_rc
Name: Focub_II5_mimp_1__contig_1.16(656599:656809)_Should_report_partial_for_rc
Description: Focub_II5_mimp_1__contig_1.16(656599:656809)_Should_report_partial_for_rc
Number of features: 0
Seq('CGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCA...GTA', SingleLetterAlphabet())
Partial rc mimp found at 185



ID: Focub_II5_mimp_2__contig_1.47(41315:41540)_should_report_partial_forward
Name: Focub_II5_mimp_2__contig_1.47(41315:41540)_should_report_partial_forward
Description: Focub_II5_mimp_2__contig_1.47(41315:41540)_should_report_partial_forward
Number of features: 0
Seq('TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTA...TCT', SingleLetterAlphabet())
Partial forward mimp found at 2



ID: Focub_II5_mimp_3__contig_1.65(13656:13882)
Name: Focub_II5_mimp_3__contig_1.65(13656:13882)
Description: Focub_II5_mimp_3__contig_1.65(13656:13882)
Number of features: 0
Seq('TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCT...GTA', SingleLetterAlphabet())
complete mimp found at 2 to 224

Example input:

>Focub_II5_mimp_1__contig_1.16(656599:656809)_Should_report_partial_for_rc
CGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)_should_report_partial_forward
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCT
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA

ADD COMMENT • link 3.7 years ago by jamie.pike ▴ 80

0

Entering edit mode

Are you sure the output is correct? All you've done is imported a submodule. With your code as it is, this shouldn't have fixed anything...

Also, you need to search the reverse complement of the sequence, not just use a reverse complemented regex (as one of the other comments points out) if you want to find second strand hits. BioPython stores no RC information by default when you iterate the fasta file, it will only ever read the sequences as they are. If you want to consider the reverse complement of the sequence too, you need to explicitly calculate it. The method is contained within the Seq object, which is encapsulated within the SeqRecord object when you read a file.

ADD REPLY • link 3.7 years ago by Joe 21k

0

Entering edit mode

I have checked the output and it seems to be correct. I'm looking for a DNA transposon using the TIRs. So I would look for a single-stranded sequence of nucleotides (the regex) followed downstream by its reverse complement:

"CAGTGGG..GCAA[TA]AA" and then downstream its reverse complement, "TT[TA]TTGC..CCCACTG"

The sequences I have been using as my input contain this TIR. So, correct me if I am wrong, I don't believe I need to search for the TIR by searching the reverse complement of the sequence. I apologise, I should have stated that I am looking for TIRs and not "I want to search through a FASTA looking for a specific sequence (CAGTGGG..GCAA[TA]AA and the reverse complement)". I hope that this makes sense.

I think by importing the submodule it is actually searching the sequence, rather than the output from the SeqIO module when using re.search(r"CAGTGGG..GCAA[TA]AA", (str(sequence)). For instance, when I use print(str(sequence)) the output is some of the sequence info and then only the first part of the sequence (containing the "forward" regex), whereas when I use print(str(sequence.seq)) the output is the only sequence e.g.:

code:

from Bio import SeqIO
from Bio.Seq import Seq

for sequence in SeqIO.parse("Mimp_finder_test.fasta", "fasta"):
    print("\n\n Bio_SeqIO")
    print(str(sequence))
    print("\n\n_Bio.Seq")
    print(str(sequence.seq))

input:

>Focub_B2_mimp_2__contig_135(1675161:1675386)
TACAGTGGGGGGCAATAAGTATGAATACCCTTTGATGTACTGACACACACCTCTTAGCCTAAAACGAGCTATGTTAACTCCTAATCCTGGTTAGAGGTTTTCCCTAACCAATTAGTATAAGAGATAGCTATGTGTGAGGTTAATATACACACTCTAGCTGTTAAAATGGGCTAGCGGCAGAAAACAATACACGCCCGGTATTCATACTTATTGCCTCCCACTGTA

output:

    Bio_SeqIO
ID: Focub_B2_mimp_2__contig_135(1675161:1675386)
Name: Focub_B2_mimp_2__contig_135(1675161:1675386)
Description: Focub_B2_mimp_2__contig_135(1675161:1675386)
Number of features: 0
Seq('TACAGTGGGGGGCAATAAGTATGAATACCCTTTGATGTACTGACACACACCTCT...GTA', SingleLetterAlphabet())


_Bio.Seq
TACAGTGGGGGGCAATAAGTATGAATACCCTTTGATGTACTGACACACACCTCTTAGCCTAAAACGAGCTATGTTAACTCCTAATCCTGGTTAGAGGTTTTCCCTAACCAATTAGTATAAGAGATAGCTATGTGTGAGGTTAATATACACACTCTAGCTGTTAAAATGGGCTAGCGGCAGAAAACAATACACGCCCGGTATTCATACTTATTGCCTCCCACTGTA

I hope that this makes sense. If there still seems like there is an issue and I have missed your point, I apologise. I am still very new to this.

Thank you

ADD REPLY • link 3.7 years ago by jamie.pike ▴ 80

1

Entering edit mode

I see, thanks for clarifying, in that case, I think you are right to search the same strand for both of the recognition sequences (though it would also be valid to search for the same sequence on the reverse strand as the end result should be the same I think).

I could be wrong, but it would strike me that the latter approach is probably more robust anyway, as it means that you will automatically find the 2 recognition sites which are farthest apart from one another, since you'll scan the sequence from each 5' direction, versus the current approach which would. If, by chance you happen to have 2 of one kind of site, and 1 of the other, one will get missed - but I'm not sure if this is a problem for what your doing so might be academic anyway.

I think by importing the submodule it is actually searching the sequence, rather than the output from the SeqIO module when using re.search(r"CAGTGGG..GCAA[TA]AA", (str(sequence)). For instance, when I use print(str(sequence)) the output is some of the sequence info and then only the first part of the sequence (containing the "forward" regex), whereas when I use print(str(sequence.seq)) the output is the only sequence...

This is where you're partly wrong, and you would be well served in familiarising yourself with BioPythons SeqRecord and Seq object models. Importing Bio.Seq has no bearing on this.

The reason print(str(sequence)) prints:

ID: Focub_B2_mimp_2__contig_135(1675161:1675386)
Name: Focub_B2_mimp_2__contig_135(1675161:1675386)
Description: Focub_B2_mimp_2__contig_135(1675161:1675386)
Number of features: 0
Seq('TACAGTGGGGGGCAATAAGTATGAATACCCTTTGATGTACTGACACACACCTCT...GTA', SingleLetterAlphabet())

Is because you're confusing yourself with the variable names you've used. In this case sequence is a SeqRecord, which contains a Seq object, which is accessed when you call, in your case, sequence.seq. You do not need to import Seq to access this object, so your 'fix' isn't what you think, and it makes me worry that actually you've just managed to conceal the underlying issue without realising it.

As a learning exercise, compare the output of the following (using your nomenclature):

>>> print(sequence)
>>> print(str(sequence)

These are the same. Compared to,

>>> print(sequence.seq)
>>> print(str(sequence.seq)

To really satisfy yourself of this, call both objects without the print():

>>> sequence
SeqRecord(seq=Seq('TACAGTGGGATGCAATAAGTTTGAATCCAGGCTGAAGTACCAGCTGTTGTAATC...GTA', SingleLetterAlphabet()), id='Focub_II5_mimp_6__contig_1.86(922:1140)', name='Focub_II5_mimp_6__contig_1.86(922:1140)', description='Focub_II5_mimp_6__contig_1.86(922:1140)', dbxrefs=[])
>>> sequence.seq
Seq('TACAGTGGGATGCAATAAGTTTGAATCCAGGCTGAAGTACCAGCTGTTGTAATC...GTA', SingleLetterAlphabet())

These are different objects, and SeqRecord contains an instance of Seq which holds the sequence information.

I assure you, importing Bio.Seq has not fixed this in the way you think it has. It may be the case that your code happens to work now anyway, but be wary of this, because it may actually be failing in a way that your test data doesn't capture, and doing so silently so it may ruin your subsequent analysis.

I'll update my code now and see if it matches what you expect.

ADD REPLY • link 3.7 years ago by Joe 21k

1

Entering edit mode

Thank you for clarifying. I have now removed the Bio.seq module from my code posted earlier, and it still works. I'm not sure what "fixed" it, but it appears I was way off! I shall certainly heed your advice and will try to familiarise my self with more of BioPythons modules. Thank you for all your help and guidance, the new code you posted which works tremendously.

ADD REPLY • link 3.7 years ago by jamie.pike ▴ 80

score 4 · Accepted Answer · 2020-08-10

NB, this post is heavily edited from its original content due to length.

I've revised the code with the clarification from the OP, and I think this (and their) code should work - though not for the reasons they thought.

import re, sys
from Bio import SeqIO

def red(text):
    return "\033[0;31m{}\033[0m".format(text)

def blue(text):
    return "\033[0;34m{}\033[0m".format(text)

regex = re.compile("CAGTGGG..GCAA[TA]AA")
regex_rc = re.compile("TT[TA]TTGC..CCCACTG")

for record in SeqIO.parse(sys.argv[1], "fasta"):
    hit = re.search(regex, str(record.seq))
    hit_rc = re.search(regex_rc, str(record.seq))

    print("="*80 + "\n", record)
    if hit and hit_rc:
        record.seq = str(record.seq).replace(hit.group(0), red(hit.group(0))).replace(hit_rc.group(0), blue(hit_rc.group(0)))
        print(f"MIMPs found on forward strand at {hit.start()}:{hit.end()}, and reverse strand at {hit_rc.start()}:{hit_rc.end()}")
        print(record.seq)
    elif hit and not hit_rc:
        record.seq = str(record.seq).replace(hit.group(0), red(hit.group(0)))
        print(f"Partial MIMP found on forward strand only at {hit.start()}:{hit.end()}.")
        print(record.seq)
    elif hit_rc and not hit:
        record.seq = str(record.seq).replace(hit_rc.group(0), blue(hit_rc.group(0)))
        print(f"Partial MIMP found on reverse strand only at {hit_rc.start()}:{hit_rc.end()}")
        print(record.seq)
    elif not hit and not hit_rc:
        print(f"No MIMPs found on either strand")

I still think a more robust method might be to search the reverse strand specifically and adjust the coordinates relative to the forward strand, but the recognition sequence is long enough that the odds of it appearing by chance are relatively low I think.

For clarity of output, the code above colours the result:

I believe this is correct. The code to colour the matches can be thrown away. It seems my previous code searching the reverse strand was correct, and the hits coming up at 2:18 all the time, was because the target sites are (in this test data) always 2 bases in from the 5' end of the sequence.

At least for this sample set, searching the forward strand with the regex and its reverse complement appears to work satisfactorily, however it is worth being aware that the greedy/conservative nature of regex capture will affect which sequence is captured in the event that there are more than one per query sequence, since the current regex strategy does not use `findall`.