Retrieve Protein Id And Description From Fasta Header Using Python?
2
1
Entering edit mode
12.2 years ago
Martin ▴ 10

I have looked through all of the tagged Fasta posts, none seem to address this particular issue.

I have the following format for my headers:

>GeneDB|LinJ.35.4080 | organism=Leishmania_infantum | product=ATP-dependent RNA helicase, putative | location=LinJ.35:1596105-1598177(+) | length=690

I would like to learn how to retrieve the "LinJ.35.4080" (let's call this proteinID) as well as the "product=ATP-dependent RNA helicase, putative" (call this proteindescription) fields for all the records in my file and create a two column text file (proteinID, proteindescription).

If known, it would be helpful to remove the "product=" from the field as well.

I have searched through all of the online tutorials (that I could find) related to FASTA, Biopython, Python, etc., and thought (foolishly) my first attempt at this would go a bit smoother.

Thank you in advance for your help and time, this is something that will be used often. I'm hoping I didn't overlook someone else's solution to this.

fasta biopython protein python • 3.8k views
ADD COMMENT
1
Entering edit mode
12.2 years ago
Ijessie ▴ 70

Import regular expression in Python here:

$ python
>>> import re
>>> s = ">GeneDB|LinJ.35.4080 | organism=Leishmania_infantum | product=ATP-dependent RNA helicase, putative | location=LinJ.35:1596105-1598177(+) | length=690"
>>> m = re.search(r'(?m)\>GeneDB\|([^|]*)\s\|.*product=([^|]*)\s\|', s)
>>> if m:
...     pro_id = m.group(1)
...     pro_de = m.group(2)
... 
>>> pro_id, pro_de
('LinJ.35.4080', 'ATP-dependent RNA helicase, putative')
ADD COMMENT
1
Entering edit mode
12.2 years ago
Neilfws 49k

No need for Python or any complicated programming. Assuming that all your FASTA headers are of the same form with fields separated by "|" in file myfile.fa, just use grep + awk + sed:

grep "^>" myfile.fa | awk 'BEGIN {FS="|"} {print $2, $4}' | sed 's/product=//'

Result:

LinJ.35.4080   ATP-dependent RNA helicase, putative
ADD COMMENT
0
Entering edit mode

Agree, but if you really want to use python, you can use .split -> http://docs.python.org/library/string.html#string.split

ADD REPLY

Login before adding your answer.

Traffic: 2908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6