Question: Retrieve Protein Id And Description From Fasta Header Using Python?
1
gravatar for Martin
8.9 years ago by
Martin10
Martin10 wrote:

I have looked through all of the tagged Fasta posts, none seem to address this particular issue.

I have the following format for my headers:

>GeneDB|LinJ.35.4080 | organism=Leishmania_infantum | product=ATP-dependent RNA helicase, putative | location=LinJ.35:1596105-1598177(+) | length=690

I would like to learn how to retrieve the "LinJ.35.4080" (let's call this proteinID) as well as the "product=ATP-dependent RNA helicase, putative" (call this proteindescription) fields for all the records in my file and create a two column text file (proteinID, proteindescription).

If known, it would be helpful to remove the "product=" from the field as well.

I have searched through all of the online tutorials (that I could find) related to FASTA, Biopython, Python, etc., and thought (foolishly) my first attempt at this would go a bit smoother.

Thank you in advance for your help and time, this is something that will be used often. I'm hoping I didn't overlook someone else's solution to this.

fasta python protein biopython • 2.8k views
ADD COMMENTlink written 8.9 years ago by Martin10
1
gravatar for Ijessie
8.9 years ago by
Ijessie70
Ijessie70 wrote:

Import regular expression in Python here:

$ python
>>> import re
>>> s = ">GeneDB|LinJ.35.4080 | organism=Leishmania_infantum | product=ATP-dependent RNA helicase, putative | location=LinJ.35:1596105-1598177(+) | length=690"
>>> m = re.search(r'(?m)\>GeneDB\|([^|]*)\s\|.*product=([^|]*)\s\|', s)
>>> if m:
...     pro_id = m.group(1)
...     pro_de = m.group(2)
... 
>>> pro_id, pro_de
('LinJ.35.4080', 'ATP-dependent RNA helicase, putative')
ADD COMMENTlink written 8.9 years ago by Ijessie70
1
gravatar for Neilfws
8.9 years ago by
Neilfws49k
Sydney, Australia
Neilfws49k wrote:

No need for Python or any complicated programming. Assuming that all your FASTA headers are of the same form with fields separated by "|" in file myfile.fa, just use grep + awk + sed:

grep "^>" myfile.fa | awk 'BEGIN {FS="|"} {print $2, $4}' | sed 's/product=//'

Result:

LinJ.35.4080   ATP-dependent RNA helicase, putative
ADD COMMENTlink modified 8.9 years ago • written 8.9 years ago by Neilfws49k

Agree, but if you really want to use python, you can use .split -> http://docs.python.org/library/string.html#string.split

ADD REPLYlink written 8.9 years ago by Robert Ernst60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour
_