Question

Using Biopython to search a fasta file for specific uniprots

0

Entering edit mode

7.0 years ago

chrisgbeldam ▴ 20

I am trying to use BioPython to search through a fasta file. I can get BioPython to search and return the entire fasta file. However, I want to be able to then specify a specific uniprot in that fasta file and have BioPython search for it and returns its attributes e.g. ID, seq and so on.

This is the code I have to search through the entire fasta file:

from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
print("%s %s %s %i" % record.id, record, record.seq, len(record)))

Here is what I started writing but doesn't seem to work:

from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
for record in [P01730]:
    print("%s %s %s %i" %record.id, record, record.seq, len(record)))

uniprot python biopython • 3.9k views

ADD COMMENT • link updated 7.0 years ago by Andrzej Zielezinski 11k • written 7.0 years ago by chrisgbeldam ▴ 20

score 4 · Accepted Answer · 2017-04-24

4

Entering edit mode

7.0 years ago

Andrzej Zielezinski 11k

This should do the job:

from Bio import SeqIO
myuniprots = ['P01730', 'P01731']
for record in SeqIO.parse("CD4.fasta", "fasta"):
    id = record.id.split('|')[1] if '|' in record.id else record.id
    if id in myuniprots:
        print("%s %s %i" % record.id, record.seq, len(record)))

Note: The last line should be print("%s %s %i" % (record.id, record.seq, len(record))). The formating omits the (.

ADD COMMENT • link 7.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks!

Silly comment to make but the script runs and prints nothing. Am I safe to then assume that uniprot is not in the FASTA file?

ADD REPLY • link 7.0 years ago by chrisgbeldam ▴ 20

0

Entering edit mode

Try to create a small fasta file with id P01730 and then check the script on that file.

ADD REPLY • link 7.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Yeap runs but prints nothing

Here is the fasta file format:

sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 OS=Homo sapiens GN=CD4 PE=1 SV=1 M NRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

ADD REPLY • link 7.0 years ago by chrisgbeldam ▴ 20

1

Entering edit mode

This is because biopython sees id as sp|P01730|CD4_HUMAN, rather than P01730. I fixed it and updated my answer.

ADD REPLY • link 7.0 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

Thank you! I didn't realise that's how BioPython see's ids

ADD REPLY • link 7.0 years ago by chrisgbeldam ▴ 20

0

Entering edit mode

Last question, why does that only return P01731 as opposed to both of them?

ADD REPLY • link 7.0 years ago by chrisgbeldam ▴ 20

0

Entering edit mode

Do you mean both sequence records or both ids?

ADD REPLY • link 7.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

So now if I want to grab the CD4_HUMAN aspect as the species.

Would I do something like this?

species = record.id.split('|')[2] if '|' in record.id else record.species?

ADD REPLY • link 7.0 years ago by chrisgbeldam ▴ 20

0

Entering edit mode

Yes, I can also rewrite the code for you to check whether biopython's id contains your query id. In this way, you wouldn't have to change the code anytime you look for P01731 or CD4_HUMAN. The script would be universal - it will work on any sequences (e.g. from different databases) without changing the code. However, the code would be slower. Do you want me to rewrite this?

ADD REPLY • link 7.0 years ago by Andrzej Zielezinski 11k