Question

Protein name in GenBank

0

Entering edit mode

16 months ago

Daniel • 0

Hi everyone!

I make a program in python, which download genome sequences from GenBank and look for protein sequences. It should join the same protein sequences from different organism to one file. The problem is the same proteins in GenBank have different name and my program can't recognise those sequences as sequences from the same protein. For example: "photosystem II cytochrome b559 alpha subunit" and "photosystem II cytochrome b559 subunit alpha". It's almost the same but of course it's not for python. Does anybody have idea to merge the same protein from different organisms? Maybe using some feature.qualifiers?

Thanks for any help!

genbank python biopython • 669 views

ADD COMMENT • link updated 16 months ago by GenoMax 144k • written 16 months ago by Daniel • 0

0

Entering edit mode

This task is usually done by sequence comparison rather than looking into fasta headers. If proteins are identical above the certain threshold, then we consider them related regardeless of their annotations.

ADD REPLY • link 16 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I'm not looking into fasta headers, but into GenBank, because I download the sequences in gb format. I have to do it with almost 1500 genomes so sequence comparison is not an option, due to the time

ADD REPLY • link 16 months ago by Daniel • 0

1

Entering edit mode

I think you either take time to do it properly, or you will end up missing some proteins. BLAST sequence search, even on 1500 genomes, can be done in a matter of hours assuming you have access to a relatively modern desktop computer. Even on a laptop it should be doable in under a day.

ADD REPLY • link 16 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

If you must do this programmatically then collect all cases where things differ in order of words (or number of words) like the example above and examine them manually.

ADD REPLY • link 16 months ago by GenoMax 144k