PARSE by gene name in biopython
1
1
Entering edit mode
5.1 years ago

HI

I am trying to parse a fasta file by a key word in the description using biopython (1.73). I am a novice and perhaps this question has been asked and answered but I cant seem to find the solution myself.

My fasta file looks like this. I want to create a file that contains only the sequence with gene=balVIM

>lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAACTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCGAGTCCGCTCG
CTTTTTCCGTAGATTCTAGCGGTGAGTATCCGACAGTCAGCGAAATTCCGGTCGGGGAGGTCCGGCTTTA
>lcl|KP749829.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAATTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCGAGTCCGCTCG
CTTT
>lcl|EU118148.2_gene_1 [gene=intI1] [location=complement(<1..231)] [gbkey=Gene]
ATGAAAACCGCCACTGCGCCGTTACCACCGCTGCGTTCGGTCAAGGTTCTGGACCAGTTGCGTGAGCGCA
TACGCTACTTGCATTACAGTTTACGAACCGAACAGGCTTATGTCAACTGGGTTCGTGCCTTCATCCGTTT
CCACGGTGTGCGTCACCCGGCAACCTTGGGCAGCAGCGAAGTCGAGGCATTTCTGTCCTGGCTGGCGAAC
GAGCGCAAGGTTTCGGTCTCC
>lcl|EU118148.2_gene_2 [gene=aacA29a] [location=485..880] [gbkey=Gene]
GTTTCGATCTTACCTGTGAAAGAACAAGACGCTGCCGACTGGCTAGCGCTGCGGAATCTTCTTTGGCTCG

Thank You

sequence gene • 1.4k views
ADD COMMENT
0
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

I am trying to parse a fasta file by a key word in the description using biopython (1.73).

Please show us what you tried.

ADD REPLY
3
Entering edit mode
5.1 years ago
Joe 21k

It's pretty straightfoward. What you'd probably find yourself caught out by is that biopython discards header info after the first space in the sequence.id field, so instead, you simply need to test the sequence.description field which retains it all.

i.e., if the first sequence is called sequence and is:

>lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAACTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCG
AGTCCGCTCGCTTTTTCCGTAGATTCTAGCGGTGAGTATCCGACAGTCAGCGAAATTCCG
GTCGGGGAGGTCCGGCTTTA

Then sequence.id becomes: lcl|KP771862.1_gene_1, and sequence.description is lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]

Is just a matter of checking if "blaVIM" is in sequence.description.

>>> from Bio import SeqIO
>>> for rec in SeqIO.parse('bla.fasta', 'fasta'):
...     if "blaVIM" in rec.description:
...         print(rec.format('fasta'))
...
>lcl|KP771862.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAACTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCG
AGTCCGCTCGCTTTTTCCGTAGATTCTAGCGGTGAGTATCCGACAGTCAGCGAAATTCCG
GTCGGGGAGGTCCGGCTTTA

>lcl|KP749829.1_gene_1 [gene=blaVIM] [location=1..801] [gbkey=Gene]
ATGTTCAAATTTTTGAGTAAGTTATTGGTCTATTTGACCGCGTCTATCATGGCTATTGCG
AGTCCGCTCGCTTT
ADD COMMENT
0
Entering edit mode

Thank you. This is what I needed.

ADD REPLY
1
Entering edit mode

Great, make sure you go ahead and accept the answer if its resolved to provide closure for the thread :)

ADD REPLY

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6