Question

PARSE by gene name in biopython

1

Entering edit mode

5.1 years ago

steven.marshall2 ▴ 10

HI

I am trying to parse a fasta file by a key word in the description using biopython (1.73). I am a novice and perhaps this question has been asked and answered but I cant seem to find the solution myself.

My fasta file looks like this. I want to create a file that contains only the sequence with gene=balVIM

>lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAACTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCGAGTCCGCTCG
CTTTTTCCGTAGATTCTAGCGGTGAGTATCCGACAGTCAGCGAAATTCCGGTCGGGGAGGTCCGGCTTTA
>lcl|KP749829.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAATTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCGAGTCCGCTCG
CTTT
>lcl|EU118148.2_gene_1 [gene=intI1] [location=complement(<1..231)] [gbkey=Gene]
ATGAAAACCGCCACTGCGCCGTTACCACCGCTGCGTTCGGTCAAGGTTCTGGACCAGTTGCGTGAGCGCA
TACGCTACTTGCATTACAGTTTACGAACCGAACAGGCTTATGTCAACTGGGTTCGTGCCTTCATCCGTTT
CCACGGTGTGCGTCACCCGGCAACCTTGGGCAGCAGCGAAGTCGAGGCATTTCTGTCCTGGCTGGCGAAC
GAGCGCAAGGTTTCGGTCTCC
>lcl|EU118148.2_gene_2 [gene=aacA29a] [location=485..880] [gbkey=Gene]
GTTTCGATCTTACCTGTGAAAGAACAAGACGCTGCCGACTGGCTAGCGCTGCGGAATCTTCTTTGGCTCG

Thank You

sequence gene • 1.4k views

ADD COMMENT • link updated 5.1 years ago by Joe 21k • written 5.1 years ago by steven.marshall2 ▴ 10

0

Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 5.1 years ago by WouterDeCoster 47k

0

Entering edit mode

I am trying to parse a fasta file by a key word in the description using biopython (1.73).

Please show us what you tried.

ADD REPLY • link 5.1 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2019-03-05

It's pretty straightfoward. What you'd probably find yourself caught out by is that biopython discards header info after the first space in the sequence.id field, so instead, you simply need to test the sequence.description field which retains it all.

i.e., if the first sequence is called sequence and is:

>lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAACTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCG
AGTCCGCTCGCTTTTTCCGTAGATTCTAGCGGTGAGTATCCGACAGTCAGCGAAATTCCG
GTCGGGGAGGTCCGGCTTTA

Then sequence.id becomes: lcl|KP771862.1_gene_1, and sequence.description is lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]

Is just a matter of checking if "blaVIM" is in sequence.description.

>>> from Bio import SeqIO
>>> for rec in SeqIO.parse('bla.fasta', 'fasta'):
...     if "blaVIM" in rec.description:
...         print(rec.format('fasta'))
...
>lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAACTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCG
AGTCCGCTCGCTTTTTCCGTAGATTCTAGCGGTGAGTATCCGACAGTCAGCGAAATTCCG
GTCGGGGAGGTCCGGCTTTA

>lcl|KP749829.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAATTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCG
AGTCCGCTCGCTTT