Question

Same NCBI IDs for different Ensembl IDs / Multiple NCBI IDs for an Ensembl ID

0

Entering edit mode

6 weeks ago

Maxwell ▴ 20

What should I do when there are multiple NCBI IDs for an Ensembl ID in my RNAseq differential expression results? The goal is to conduct downstream functional analyses such as KEGG / Reactome which need a KEGG ID or a NCBI/ENTREZ ID. However, certain ENSEMBL IDs map to several NCBI IDs as seen below. In this case I assume the first is commonly chosen, but how is this potentially adding variability and affecting my analysis? The more important problem is that different ENSEMBL IDs map to the same NCBI ID, which could be problematic in GSEA masking enrichment between upregulated or downregulated gene sets when using ranked log 2 fold change list of genes.

There's about 700 in my dataset (10%) that map to several NCBI IDs -- as an example:

                ENSEMBL      Symbol      NCBI
841    ENSGALG00010000220             121108680
846    ENSGALG00010000220             112531415
850    ENSGALG00010000220             107049648
1656   ENSGALG00010000581             100859276
1663   ENSGALG00010000581                425562
1895   ENSGALG00010000601             121112239
1896   ENSGALG00010000601             121107771
1899   ENSGALG00010000601             121107782
1941   ENSGALG00010000604             107049738
1942   ENSGALG00010000604             121108665
2319   ENSGALG00010000686                426408
2321   ENSGALG00010000686             101747443
2551   ENSGALG00010000721             107051303
2553   ENSGALG00010000721             112531072
11556  ENSGALG00010001886        NEU3    430542
11557  ENSGALG00010001886        NEU3    419056
11828  ENSGALG00010001909             121112765
11829  ENSGALG00010001909             112530813
11832  ENSGALG00010001909             121107968
12009  ENSGALG00010001939             107049240
12011  ENSGALG00010001939             112530875
12012  ENSGALG00010001939             121107979
12018  ENSGALG00010001939             112530888
12495  ENSGALG00010001992             121107995
12497  ENSGALG00010001992             121108030
12500  ENSGALG00010001992             121112380
12536  ENSGALG00010002004             121112323
12537  ENSGALG00010002004             121108034
12587  ENSGALG00010002014             121108017
12588  ENSGALG00010002014                776232
12878  ENSGALG00010002075             107049244
12879  ENSGALG00010002075             121108011
12882  ENSGALG00010002075             121112396
12884  ENSGALG00010002075             121108008
12886  ENSGALG00010002075             121112384
12892  ENSGALG00010002075             121108002
13131  ENSGALG00010002122             121106930
13132  ENSGALG00010002122             121106929
13382  ENSGALG00010002160       CAPN5 101749160
13383  ENSGALG00010002160       CAPN5    419086
13698  ENSGALG00010002215      YLEC18 121106940
13699  ENSGALG00010002215      YLEC18 100858620
13700  ENSGALG00010002215      YLEC18 100858840

There's about 70 duplicate NCBI gene IDs resulting from different ENSEMBL IDs. -- an example:

                ENSEMBL      NCBI
841    ENSGALG00010029462   121111925 
846    ENSGALG00010029501   121111925

KEGG ENTREZ ENSEMBL • 470 views

ADD COMMENT • link 6 weeks ago by Maxwell ▴ 20

0

Entering edit mode

Restrict yourself to the canonical chromosomes if you can - that should eliminate a lot of these cases.

ADD REPLY • link 6 weeks ago by Ram 43k

0

Entering edit mode

Hey thank you for your reply. So I used this for reference (Only chromosomes 1-39, MT, W, Z): https://ftp.ensembl.org/pub/release-111/fasta/gallus_gallus/dna/ in this format for example: Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.dna.primary_assembly.Z.fa.gz . So if youre referring to any extra contigs in the reference added for variation, I dont think that applies correct? If you could expand to help me understand, maybe that would make me see this a bit clearer.

ADD REPLY • link 6 weeks ago by Maxwell ▴ 20

0

Entering edit mode

Also, im not sure if this answers the problem because this is due to multimapping. Youre saying basically if I just discard the ENSEMBL terms that have multimapping my problem goes away? Im pretty sure gseKEGG is already doing that and that is also not really what we want I dont believe?

ADD REPLY • link 6 weeks ago by Maxwell ▴ 20

0

Entering edit mode

I don't get the multimapping thing - I think your database is outdated. Look at these entries:

11556  ENSGALG00010001886        NEU3    430542
11557  ENSGALG00010001886        NEU3    419056

While https://ncbi.nlm.nih.gov/gene/430542 is NEU3, https://ncbi.nlm.nih.gov/gene/419056 is SPCS2.

ADD REPLY • link 6 weeks ago by Ram 43k

0

Entering edit mode

Updated database and package-- for some reason biomart (double checked its latest version) and the GTF file I downloaded straight from ENSEMBL has inconsistencies with Gene names that are listed on NCBI connected to an ENSEMBL ID.

When looking up some of their gene names/NCBIs, the Ensembl IDs dont match to the gene names on NCBI.

clusterProfiler bitr() worked best, however, their are still duplicate ENSEMBL IDs because they map to multiple NCBIs, so again I just took the first ENSEMBL in a duplicate.

Following this, their were still 152 unique ENSEMBL IDs that map to the same NCBI ID as another ENSEMBL ID. When I manually checked these, some are not correct on NCBI, some of the ENSEMBL IDs arent listed on NCBI at all. So I am manually correcting these, I cant think of a better way. Not sure what to do about the ENSEMBL IDs that dont exist on NCBI. I suppose I could remove them and re run DE I guess, but I see potential statistical problems

ADD REPLY • link 6 weeks ago by Maxwell ▴ 20