NCBI protein id to gene id mapping for deleted entries
1
0
Entering edit mode
2.5 years ago

Dear BioStars forum,

I recently discovered OrthoDB.org which seems like a great ressource that I want to use to facilitate the comparison of gene expression of different mammalian species. My thought was that I could use OrthoDB to assign every gene in my RNAseq dataset (consisiting of 7 species, 6 closely related and human) a Orthogroup and then, instead of comparing single genes, compare average expression of all genes in each orthogroup. My RNAseq dataset has the gene symbol and the Entrez Gene ID, while the OrthoDB dataset doesn't seem to have proper mapping from their internal gene ID to Entrez gene ID (for around 1/3 of all mammalian genes in their full dataset there is no xref entry to NCBIgid https://data.orthodb.org/download/odb11v0_gene_xrefs.tab.gz). However every entry of the OrthoDB genes table has the Entrez protein ID of the original sequence used in their analysis. I was able to map most these to the gene IDs using NCBIs gene2refseq (https://ftp.ncbi.nlm.nih.gov/gene/DATA/) table, but a significant number of them don't appear in this mapping table (including most genes of some species I am interested in). When I manually search them at NCBI I can find them but they are marked as deleted. I assume the OrthoDB tree was built some time ago on now outdated proteomes. Do you have any advice on how to map these oficially deleted protein IDs to gene IDs and gene symbols? Are there "historic" versions of the gene2refseq table from NCBI or even better a mapping table from old protein IDs to new ones?

Best

Niklas

OrthoDB NCBI Entrez • 1.5k views
ADD COMMENT
1
Entering edit mode

Post a few examples of ID's you are not able to find.

ADD REPLY
0
Entering edit mode

Here is a random sample of OrthoDB entries whose protein IDs don't have any entry in the gene2refseq table. From 3847929 mammalian genes in the OrthoDB database this applies to 160514 genes but only from 38 species (see below the genes)

    odb_geneid, ncbi_prid, taxid
    9417_0:0039ea,XP_037021639.1,9417
    9986_0:002857,XP_008265057.1,9986
    9601_0:000523,XP_024087773.1,9601
    9417_0:00472a,XP_036986430.1,9417
    1230840_0:0029b1,XP_007945212.1,1230840
    9417_0:003dad,XP_037023781.1,9417
    9417_0:003e20,XP_037023975.1,9417
    9691_0:002c91,XP_019269535.1,9691
    59538_0:004e04,XP_005978621.1,59538
    286419_0:001585,XP_035575343.1,286419
    59538_0:002852,XP_005966836.1,59538
    9598_0:004e35,XP_009436065.1,9598
    9430_0:0048a7,XP_045042978.1,9430
    9716_0:0027fe,XP_045748034.1,9716
    59538_0:0039df,XP_005972457.1,59538
    9483_0:004953,XP_035136202.1,9483
    29078_0:0016b0,XP_028003900.1,29078
    59538_0:0034c5,XP_005970840.1,59538
    286419_0:0019e5,XP_035577450.1,286419
    42254_0:004a06,XP_004622480.1,42254
    9417_0:004a28,XP_036988092.1,9417
    29078_0:000416,XP_028015567.1,29078
    9691_0:004744,XP_019290452.1,9691
    59538_0:002dac,XP_005968559.1,59538
    9691_0:001da6,XP_019315209.1,9691


taxid, unmapped genes
59538,25476
9417,16946
29078,14900
9430,13719
9691,12655
9986,10807
9601,10804
9483,9405
9598,9020
32536,7632
286419,6836
1230840,6552
9733,6519
50954,3676
9716,2610
118797,948
10116,382
9755,298
9544,259
9913,237
9615,209
42254,136
9606,96
591936,76
10090,74
10089,61
9796,42
186990,42
9925,39
192404,38
9823,10
9555,3
9337,2
9565,1
447135,1
34839,1
336983,1
30640,1
ADD REPLY
1
Entering edit mode
2.5 years ago
GenoMax 154k

You can use Entrezdirect to get this information. Some of genes may be deprecated as you found out.

$ esearch -db protein -query XP_037023975 | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description,ScientificName
FGA     fibrinogen alpha chain  Artibeus jamaicensis

$ esearch -db protein -query XP_045042978 | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description,ScientificName
PPP6R2  protein phosphatase 6 regulatory subunit 2      Desmodus rotundus
ADD COMMENT
0
Entering edit mode

Thats great, thanks alot. I can indeed retrieve imformation about the genes coding for the proteins that lack mapping in the gene2refseq table. But of course for 160514 protein IDs it might take quite a while to retrieve all the information that way. That is why I was hoping there is a version of the gene2refseq table which includes all updated or removed proteins or if there are archives of previous gene2refseq tables.

It seems a part of the protein IDs are just updated instead of deleted and I might be able to map them by removing the version identifier. That might reduce the number of unmapped protein IDs by a lot and if noone else has an idea how to map deleted protein IDs in a fast manner, I will try your approach. Again thanks alot.

ADD REPLY

Login before adding your answer.

Traffic: 3469 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6