Dear BioStars forum,
I recently discovered OrthoDB.org which seems like a great ressource that I want to use to facilitate the comparison of gene expression of different mammalian species. My thought was that I could use OrthoDB to assign every gene in my RNAseq dataset (consisiting of 7 species, 6 closely related and human) a Orthogroup and then, instead of comparing single genes, compare average expression of all genes in each orthogroup. My RNAseq dataset has the gene symbol and the Entrez Gene ID, while the OrthoDB dataset doesn't seem to have proper mapping from their internal gene ID to Entrez gene ID (for around 1/3 of all mammalian genes in their full dataset there is no xref entry to NCBIgid https://data.orthodb.org/download/odb11v0_gene_xrefs.tab.gz). However every entry of the OrthoDB genes table has the Entrez protein ID of the original sequence used in their analysis. I was able to map most these to the gene IDs using NCBIs gene2refseq (https://ftp.ncbi.nlm.nih.gov/gene/DATA/) table, but a significant number of them don't appear in this mapping table (including most genes of some species I am interested in). When I manually search them at NCBI I can find them but they are marked as deleted. I assume the OrthoDB tree was built some time ago on now outdated proteomes. Do you have any advice on how to map these oficially deleted protein IDs to gene IDs and gene symbols? Are there "historic" versions of the gene2refseq table from NCBI or even better a mapping table from old protein IDs to new ones?
Best
Niklas
Post a few examples of ID's you are not able to find.
Here is a random sample of OrthoDB entries whose protein IDs don't have any entry in the gene2refseq table. From 3847929 mammalian genes in the OrthoDB database this applies to 160514 genes but only from 38 species (see below the genes)