Question: Using Biopython to search a fasta file for specific uniprots
0
gravatar for chrisgbeldam
24 months ago by
chrisgbeldam20
chrisgbeldam20 wrote:

I am trying to use BioPython to search through a fasta file. I can get BioPython to search and return the entire fasta file. However, I want to be able to then specify a specific uniprot in that fasta file and have BioPython search for it and returns its attributes e.g. ID, seq and so on.

This is the code I have to search through the entire fasta file:

from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
print("%s %s %s %i" % record.id, record, record.seq, len(record)))

Here is what I started writing but doesn't seem to work:

from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
for record in [P01730]:
    print("%s %s %s %i" %record.id, record, record.seq, len(record)))
biopython uniprot python • 1.0k views
ADD COMMENTlink modified 24 months ago by a.zielezinski8.6k • written 24 months ago by chrisgbeldam20
4
gravatar for a.zielezinski
24 months ago by
a.zielezinski8.6k
a.zielezinski8.6k wrote:

This should do the job:

from Bio import SeqIO
myuniprots = ['P01730', 'P01731']
for record in SeqIO.parse("CD4.fasta", "fasta"):
    id = record.id.split('|')[1] if '|' in record.id else record.id
    if id in myuniprots:
        print("%s %s %i" % record.id, record.seq, len(record)))

Note: The last line should be print("%s %s %i" % (record.id, record.seq, len(record))). The formating omits the (.

ADD COMMENTlink modified 24 months ago • written 24 months ago by a.zielezinski8.6k

Thanks!

Silly comment to make but the script runs and prints nothing. Am I safe to then assume that uniprot is not in the FASTA file?

ADD REPLYlink written 24 months ago by chrisgbeldam20

Try to create a small fasta file with id P01730 and then check the script on that file.

ADD REPLYlink written 24 months ago by a.zielezinski8.6k

Yeap runs but prints nothing

Here is the fasta file format:

sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 OS=Homo sapiens GN=CD4 PE=1 SV=1 M NRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

ADD REPLYlink written 24 months ago by chrisgbeldam20
1

This is because biopython sees id as sp|P01730|CD4_HUMAN, rather than P01730. I fixed it and updated my answer.

ADD REPLYlink modified 24 months ago • written 24 months ago by a.zielezinski8.6k
1

Thank you! I didn't realise that's how BioPython see's ids

ADD REPLYlink written 24 months ago by chrisgbeldam20

Last question, why does that only return P01731 as opposed to both of them?

ADD REPLYlink written 24 months ago by chrisgbeldam20

Do you mean both sequence records or both ids?

ADD REPLYlink written 24 months ago by a.zielezinski8.6k

So now if I want to grab the CD4_HUMAN aspect as the species.

Would I do something like this?

species = record.id.split('|')[2] if '|' in record.id else record.species?

ADD REPLYlink written 24 months ago by chrisgbeldam20

Yes, I can also rewrite the code for you to check whether biopython's id contains your query id. In this way, you wouldn't have to change the code anytime you look for P01731 or CD4_HUMAN. The script would be universal - it will work on any sequences (e.g. from different databases) without changing the code. However, the code would be slower. Do you want me to rewrite this?

ADD REPLYlink modified 24 months ago • written 24 months ago by a.zielezinski8.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2040 users visited in the last hour