Question

Retrieve Protein Id And Description From Fasta Header Using Python?

1

Entering edit mode

13.3 years ago

Martin ▴ 10

I have looked through all of the tagged Fasta posts, none seem to address this particular issue.

I have the following format for my headers:

>GeneDB|LinJ.35.4080 | organism=Leishmania_infantum | product=ATP-dependent RNA helicase, putative | location=LinJ.35:1596105-1598177(+) | length=690

I would like to learn how to retrieve the "LinJ.35.4080" (let's call this proteinID) as well as the "product=ATP-dependent RNA helicase, putative" (call this proteindescription) fields for all the records in my file and create a two column text file (proteinID, proteindescription).

If known, it would be helpful to remove the "product=" from the field as well.

I have searched through all of the online tutorials (that I could find) related to FASTA, Biopython, Python, etc., and thought (foolishly) my first attempt at this would go a bit smoother.

Thank you in advance for your help and time, this is something that will be used often. I'm hoping I didn't overlook someone else's solution to this.

fasta biopython protein python • 4.1k views

ADD COMMENT • link updated 13.3 years ago by Neilfws 49k • written 13.3 years ago by Martin ▴ 10

score 1 · Answer 1 · 2012-03-03

Import regular expression in Python here:

$ python
>>> import re
>>> s = ">GeneDB|LinJ.35.4080 | organism=Leishmania_infantum | product=ATP-dependent RNA helicase, putative | location=LinJ.35:1596105-1598177(+) | length=690"
>>> m = re.search(r'(?m)\>GeneDB\|([^|]*)\s\|.*product=([^|]*)\s\|', s)
>>> if m:
...     pro_id = m.group(1)
...     pro_de = m.group(2)
... 
>>> pro_id, pro_de
('LinJ.35.4080', 'ATP-dependent RNA helicase, putative')

score 1 · Answer 2 · 2012-03-03

1

Entering edit mode

13.3 years ago

Neilfws 49k

No need for Python or any complicated programming. Assuming that all your FASTA headers are of the same form with fields separated by "|" in file myfile.fa, just use grep + awk + sed:

grep "^>" myfile.fa | awk 'BEGIN {FS="|"} {print $2, $4}' | sed 's/product=//'

Result:

LinJ.35.4080   ATP-dependent RNA helicase, putative

ADD COMMENT • link 13.3 years ago by Neilfws 49k

0

Entering edit mode

Agree, but if you really want to use python, you can use .split -> http://docs.python.org/library/string.html#string.split

ADD REPLY • link 13.3 years ago by Robert Ernst ▴ 60