Question: Automatically naming protein sequences based on homology
2.4 years ago
wrote:

I have a file of about 600 translated protein sequences, which I'd like to name automatically based on their homology. If I blast them against the nr database, the hits they return often don't have consistently annotated names, for example one protein could hit something named 'NADH-ubiquinone reductase' and another 'complex 1 dehydrogenase', which are the same protein (KEGG: but have been named differently depending on who submitted it to ncbi.

Is there some kind of 'official' name that can be assigned for proteins, then I can quickly screen through them and see if a particular protein is present or not, or how many copies of that particular protein there are, for example. At the moment, it's difficult to know whether my searches are exhaustive as I might be looking for a protein under one name but it's called something completely different.

Someone suggested to me assigning each protein to groups of orthologs, for example using something like OrthoMCL or EggNOG but I'm struggling to understand how to use these, especially for several hundred sequences.

If anyone could suggest a strategy or give some indication of how to use these ortholog database for the purpose I've described, I'd greatly appreciate the help. Cheers!

The most official name I know for them are for the proteins are in the uniprot database. You could blast against the uniprot database and assign the consenus name it returns. But it is unclear to me why do you want to assign a name from a blast search to your sequences.

I'm trying to identify what proteins I have by comparing them to homologues in other species using blast. They have all been sequenced de novo using RNA-seq and there's no genome sequences available from the species I'm working with. There are some key proteins I expect to find in the dataset and I'm identifying them by searching through the blast results. Currently if I don't find one of these proteins, I don't know if it's really not present in the dataset or whether I just haven't found it because the protein is named something else in the blast results.

Hopefully this is a bit clearer?

Never mind, found the sort of thing I need, eggNOG mapper does the job:

