Using Biopython to search a fasta file for specific uniprots
1
0
Entering edit mode
7.0 years ago
chrisgbeldam ▴ 20

I am trying to use BioPython to search through a fasta file. I can get BioPython to search and return the entire fasta file. However, I want to be able to then specify a specific uniprot in that fasta file and have BioPython search for it and returns its attributes e.g. ID, seq and so on.

This is the code I have to search through the entire fasta file:

from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
print("%s %s %s %i" % record.id, record, record.seq, len(record)))

Here is what I started writing but doesn't seem to work:

from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
for record in [P01730]:
    print("%s %s %s %i" %record.id, record, record.seq, len(record)))
uniprot python biopython • 3.9k views
ADD COMMENT
4
Entering edit mode
7.0 years ago

This should do the job:

from Bio import SeqIO
myuniprots = ['P01730', 'P01731']
for record in SeqIO.parse("CD4.fasta", "fasta"):
    id = record.id.split('|')[1] if '|' in record.id else record.id
    if id in myuniprots:
        print("%s %s %i" % record.id, record.seq, len(record)))

Note: The last line should be print("%s %s %i" % (record.id, record.seq, len(record))). The formating omits the (.

ADD COMMENT
0
Entering edit mode

Thanks!

Silly comment to make but the script runs and prints nothing. Am I safe to then assume that uniprot is not in the FASTA file?

ADD REPLY
0
Entering edit mode

Try to create a small fasta file with id P01730 and then check the script on that file.

ADD REPLY
0
Entering edit mode

Yeap runs but prints nothing

Here is the fasta file format:

sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 OS=Homo sapiens GN=CD4 PE=1 SV=1 M NRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

ADD REPLY
1
Entering edit mode

This is because biopython sees id as sp|P01730|CD4_HUMAN, rather than P01730. I fixed it and updated my answer.

ADD REPLY
1
Entering edit mode

Thank you! I didn't realise that's how BioPython see's ids

ADD REPLY
0
Entering edit mode

Last question, why does that only return P01731 as opposed to both of them?

ADD REPLY
0
Entering edit mode

Do you mean both sequence records or both ids?

ADD REPLY
0
Entering edit mode

So now if I want to grab the CD4_HUMAN aspect as the species.

Would I do something like this?

species = record.id.split('|')[2] if '|' in record.id else record.species?

ADD REPLY
0
Entering edit mode

Yes, I can also rewrite the code for you to check whether biopython's id contains your query id. In this way, you wouldn't have to change the code anytime you look for P01731 or CD4_HUMAN. The script would be universal - it will work on any sequences (e.g. from different databases) without changing the code. However, the code would be slower. Do you want me to rewrite this?

ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6