I make a program in python, which download genome sequences from GenBank and look for protein sequences. It should join the same protein sequences from different organism to one file. The problem is the same proteins in GenBank have different name and my program can't recognise those sequences as sequences from the same protein. For example: "photosystem II cytochrome b559 alpha subunit" and "photosystem II cytochrome b559 subunit alpha". It's almost the same but of course it's not for python. Does anybody have idea to merge the same protein from different organisms? Maybe using some feature.qualifiers?
Thanks for any help!
This task is usually done by sequence comparison rather than looking into fasta headers. If proteins are identical above the certain threshold, then we consider them related regardeless of their annotations.
I'm not looking into fasta headers, but into GenBank, because I download the sequences in gb format. I have to do it with almost 1500 genomes so sequence comparison is not an option, due to the time
I think you either take time to do it properly, or you will end up missing some proteins. BLAST sequence search, even on 1500 genomes, can be done in a matter of hours assuming you have access to a relatively modern desktop computer. Even on a laptop it should be doable in under a day.
If you must do this programmatically then collect all cases where things differ in order of words (or number of words) like the example above and examine them manually.