Protein name in GenBank
0
0
Entering edit mode
12 months ago
Daniel • 0

Hi everyone!

I make a program in python, which download genome sequences from GenBank and look for protein sequences. It should join the same protein sequences from different organism to one file. The problem is the same proteins in GenBank have different name and my program can't recognise those sequences as sequences from the same protein. For example: "photosystem II cytochrome b559 alpha subunit" and "photosystem II cytochrome b559 subunit alpha". It's almost the same but of course it's not for python. Does anybody have idea to merge the same protein from different organisms? Maybe using some feature.qualifiers?

Thanks for any help!

genbank python biopython • 587 views
ADD COMMENT
0
Entering edit mode

This task is usually done by sequence comparison rather than looking into fasta headers. If proteins are identical above the certain threshold, then we consider them related regardeless of their annotations.

ADD REPLY
0
Entering edit mode

I'm not looking into fasta headers, but into GenBank, because I download the sequences in gb format. I have to do it with almost 1500 genomes so sequence comparison is not an option, due to the time

ADD REPLY
1
Entering edit mode

I think you either take time to do it properly, or you will end up missing some proteins. BLAST sequence search, even on 1500 genomes, can be done in a matter of hours assuming you have access to a relatively modern desktop computer. Even on a laptop it should be doable in under a day.

ADD REPLY
0
Entering edit mode

If you must do this programmatically then collect all cases where things differ in order of words (or number of words) like the example above and examine them manually.

ADD REPLY

Login before adding your answer.

Traffic: 2407 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6