Hi, I would like to parse a fasta file and get all headers and seqs that match some strings (so called pattern below).
What happens is that all the headers match and the script returns all sequences so it´s not working as expected.
Can you help finding what´s wrong ??
# biopython from Bio import SeqIO # regex library import re # file with FASTA sequence infile = "seq.fa" # File looks like this #>1238344 mouse #aagctacgacatcagctaca #>1238344 homo sapiens #ttagcatcagcatcagctacta # pattern to search for #pattern = "sapiens|mouse" pattern="Eukaryota|metagenome|[Homo Sapiens]|[Mus musculus]|[Rattus norvegicus]|Rhizobium|Gorilla|beringei|thaliana|[Oryza sativa]|Dictyostelium|mitochondria|Equus caballus|Plasmod\ ium falciparum|Drosophila melanogaster" # look through each FASTA sequence in the file for seq_record in SeqIO.parse(infile, "fasta"): matches = re.findall(pattern, seq_record.description, re.I) if matches: #print "Matches = ", len(matches),"\n", print ">",seq_record.id," ",seq_record.description, "\n", print seq_record.seq,"\n",