Question: Parse fasta header with regex in python
0
gravatar for David
2.2 years ago by
David160
David160 wrote:

Hi, I would like to parse a fasta file and get all headers and seqs that match some strings (so called pattern below).

What happens is that all the headers match and the script returns all sequences so it´s not working as expected.

Can you help finding what´s wrong ??

# biopython                                                                                                                                                                          
from Bio import SeqIO

# regex library                                                                                                                                                                      
import re

# file with FASTA sequence                                                                                                                                                           
infile = "seq.fa"

# File looks like this
#>1238344 mouse
#aagctacgacatcagctaca
#>1238344 homo sapiens
#ttagcatcagcatcagctacta


# pattern to search for                                                                                                                                                              
#pattern = "sapiens|mouse"                                                                                                                                                           
pattern="Eukaryota|metagenome|[Homo Sapiens]|[Mus musculus]|[Rattus norvegicus]|Rhizobium|Gorilla|beringei|thaliana|[Oryza sativa]|Dictyostelium|mitochondria|Equus caballus|Plasmod\
ium falciparum|Drosophila melanogaster"
# look through each FASTA sequence in the file                                                                                                                                       
for seq_record in SeqIO.parse(infile, "fasta"):


    matches = re.findall(pattern, seq_record.description, re.I)
    if matches:
        #print "Matches = ", len(matches),"\n",                                                                                                                                      
        print ">",seq_record.id," ",seq_record.description, "\n",
        print seq_record.seq,"\n",

Thanks

regex biopython python fasta • 1.6k views
ADD COMMENTlink modified 2.2 years ago by a.zielezinski8.9k • written 2.2 years ago by David160

What happens is that all the headers match and the script returns all sequences so it´s not working as expected.

Providing some examples would help. And your code snippet is sketchy at best.

ADD REPLYlink written 2.2 years ago by st.ph.n2.5k

Yes, there are definitely important parts missing from that code.

ADD REPLYlink written 2.2 years ago by WouterDeCoster41k

Sorry guys my copy and paste didn´t work as expected. Hope it´s better now.

ADD REPLYlink written 2.2 years ago by David160

if matches in this case will always evaluate to True, therefore it will print everything in the input file.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k

If i add the following sequence which is not in my pattern list (sorry guys forgot to add it). It is also returned.

#>1238344 mouse
#aagctacgacatcagctaca

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by David160
1

Running your above script, and adding a print line to print matches (before the if statement), yields this:

[' ', 'm', 'o', 'u', 's', 'e']
> 1238344   1238344 mouse
aagctacgacatcagctaca
[' ', 'h', 'o', 'm', 'o', ' ', 's', 'a', 'p', 'i', 'e', 'n', 's']
> 1238344   1238344 homo sapiens
ttagcatcagcatcagctacta

Don't use re.findall in this instance. Create a list, with each string you want to search for, then for each record in the input fasta, check if it is contained in the list.

ADD REPLYlink written 2.2 years ago by st.ph.n2.5k

Thanks, Can you give a small example, don´t really know how to do that ?

ADD REPLYlink written 2.2 years ago by David160
5
gravatar for a.zielezinski
2.2 years ago by
a.zielezinski8.9k
a.zielezinski8.9k wrote:

This should do the work:

from Bio import SeqIO                                                                                                                                                         
infile = "seq.fa"

patterns = ["Eukaryota", "metagenome", "Homo Sapiens", 
  "Mus musculus", "Rattus norvegicus", "Rhizobium",
  "Gorilla", "beringei", "thaliana", "Oryza sativa",
  "Dictyosteliu", "mitochondria", "Equus caballus",
  "Plasmodium falciparum", "Drosophila melanogaster"
]

for seq_record in SeqIO.parse(infile, "fasta"):
    for p in patterns:
        if p.lower() in seq_record.description.lower():
            print(seq_record.format("fasta"))
            break
ADD COMMENTlink written 2.2 years ago by a.zielezinski8.9k

Great thanks for the answer

ADD REPLYlink written 2.2 years ago by David160
1

If this answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 2.2 years ago by WouterDeCoster41k

Not really important but this should be even better :

from Bio import SeqIO
infile = "./test.fasta"

patterns = ["Eukaryota", "metagenome", "Homo Sapiens",
  "Mus musculus", "Rattus norvegicus", "Rhizobium",
  "Gorilla", "beringei", "thaliana", "Oryza sativa",
  "Dictyosteliu", "mitochondria", "Equus caballus",
  "Plasmodium falciparum", "Drosophila melanogaster"
]

patterns = [pattern.lower() for pattern in patterns]
for seq_record in SeqIO.parse(infile, "fasta"):
    desc = seq_record.description.lower()
    if any(pattern in desc for pattern in patterns) :
        print(seq_record.format("fasta"))
ADD REPLYlink written 2.1 years ago by jsgounot140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour