Can anybody work out why my python script is not working properly? I want to search through a FASTA looking for a specific sequence (CAGTGGG..GCAA[TA]AA and the reverse complement), then print the location of the sequence. However, the output only prints the location of the "forward" sequence and the location is incorrect (in both examples I'm sure it should be 2/3) and doesn't seem to identify both locations. I can't work out why. So far I have the following code:
#!/bin/python
#My Mimp_finder
import re
from Bio import SeqIO
for seq in SeqIO.parse("Focub_mimps.fasta", "fasta"):
print("\n\n")
print(seq)
hit = re.search(r"CAGTGGG..GCAA[TA]AA", str(seq))
hit_rc = re.search(r"TT[TA]TTGC..CCCACTG", str(seq))
if hit and hit_rc:
print(f'complete mimp found at {hit.start()} to {hit_rc.end()}')
elif hit:
print(f'Partial forward mimp found at {hit.start()}')
elif hit_rc:
print(f'Partial rc mimp found at {hit_rc.end()}')
else:
print("No Mimps")
Here is an example of the FASTA (matching sequences in bold):
Focub_II5_mimp_1__contig_1.16(656599:656809) TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA Focub_II5_mimp_2__contig_1.47(41315:41540) TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
The output for these sequences:
ID: Focub_II5_mimp_1__contig_1.16(656599:656809)
Name: Focub_II5_mimp_1__contig_1.16(656599:656809)
Description: Focub_II5_mimp_1__contig_1.16(656599:656809)
Number of features: 0
Seq('TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGA...GTA', SingleLetterAlphabet())
Partial forward mimp found at 187
ID: Focub_II5_mimp_2__contig_1.47(41315:41540)
Name: Focub_II5_mimp_2__contig_1.47(41315:41540)
Description: Focub_II5_mimp_2__contig_1.47(41315:41540)
Number of features: 0
Seq('TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTA...GTA', SingleLetterAlphabet())
Partial forward mimp found at 181
Why are only the "forward" mimp locations printed?
P.S. any recommendations for improvement or helpful tips will be greatly appreciated - I am just starting out.
Thanks,
Jamie
According to your code, you're looking for the reverse complement in the non reverse complement.
Apologies, I don't quite follow your point? I think it is printing the reverse complement location as the non-reverse complement location: "
Partial forward mimp found at 181
", but I don't see where the issue in the code is to cause this?I have now spotted that the code is:
and have changed it to:
However the problem remains.
When you do
str(seq)
, you have a string containing the forward sequence. Searching the reverse complement in that string will not yield anything. You would need to reverse complement that string yourself (or compute the reverse complement from the BioSeq object and then convert that into a string) and search against that.Can you please provide the one or few sequences containing "TT[TA]TTGC..CCCACTG"?
I believe I have fixed it, but I have provided some sequences in case you may be able to improve it further.