What should I do when there are multiple NCBI IDs for an Ensembl ID in my RNAseq differential expression results? The goal is to conduct downstream functional analyses such as KEGG / Reactome which need a KEGG ID or a NCBI/ENTREZ ID. However, certain ENSEMBL IDs map to several NCBI IDs as seen below. In this case I assume the first is commonly chosen, but how is this potentially adding variability and affecting my analysis? The more important problem is that different ENSEMBL IDs map to the same NCBI ID, which could be problematic in GSEA masking enrichment between upregulated or downregulated gene sets when using ranked log 2 fold change list of genes.
There's about 700 in my dataset (10%) that map to several NCBI IDs -- as an example:
ENSEMBL Symbol NCBI
841 ENSGALG00010000220 121108680
846 ENSGALG00010000220 112531415
850 ENSGALG00010000220 107049648
1656 ENSGALG00010000581 100859276
1663 ENSGALG00010000581 425562
1895 ENSGALG00010000601 121112239
1896 ENSGALG00010000601 121107771
1899 ENSGALG00010000601 121107782
1941 ENSGALG00010000604 107049738
1942 ENSGALG00010000604 121108665
2319 ENSGALG00010000686 426408
2321 ENSGALG00010000686 101747443
2551 ENSGALG00010000721 107051303
2553 ENSGALG00010000721 112531072
11556 ENSGALG00010001886 NEU3 430542
11557 ENSGALG00010001886 NEU3 419056
11828 ENSGALG00010001909 121112765
11829 ENSGALG00010001909 112530813
11832 ENSGALG00010001909 121107968
12009 ENSGALG00010001939 107049240
12011 ENSGALG00010001939 112530875
12012 ENSGALG00010001939 121107979
12018 ENSGALG00010001939 112530888
12495 ENSGALG00010001992 121107995
12497 ENSGALG00010001992 121108030
12500 ENSGALG00010001992 121112380
12536 ENSGALG00010002004 121112323
12537 ENSGALG00010002004 121108034
12587 ENSGALG00010002014 121108017
12588 ENSGALG00010002014 776232
12878 ENSGALG00010002075 107049244
12879 ENSGALG00010002075 121108011
12882 ENSGALG00010002075 121112396
12884 ENSGALG00010002075 121108008
12886 ENSGALG00010002075 121112384
12892 ENSGALG00010002075 121108002
13131 ENSGALG00010002122 121106930
13132 ENSGALG00010002122 121106929
13382 ENSGALG00010002160 CAPN5 101749160
13383 ENSGALG00010002160 CAPN5 419086
13698 ENSGALG00010002215 YLEC18 121106940
13699 ENSGALG00010002215 YLEC18 100858620
13700 ENSGALG00010002215 YLEC18 100858840
There's about 70 duplicate NCBI gene IDs resulting from different ENSEMBL IDs. -- an example:
ENSEMBL NCBI
841 ENSGALG00010029462 121111925
846 ENSGALG00010029501 121111925
Restrict yourself to the canonical chromosomes if you can - that should eliminate a lot of these cases.
Hey thank you for your reply. So I used this for reference (Only chromosomes 1-39, MT, W, Z): https://ftp.ensembl.org/pub/release-111/fasta/gallus_gallus/dna/ in this format for example: Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.dna.primary_assembly.Z.fa.gz . So if youre referring to any extra contigs in the reference added for variation, I dont think that applies correct? If you could expand to help me understand, maybe that would make me see this a bit clearer.
Also, im not sure if this answers the problem because this is due to multimapping. Youre saying basically if I just discard the ENSEMBL terms that have multimapping my problem goes away? Im pretty sure gseKEGG is already doing that and that is also not really what we want I dont believe?
I don't get the multimapping thing - I think your database is outdated. Look at these entries:
While https://ncbi.nlm.nih.gov/gene/430542 is NEU3, https://ncbi.nlm.nih.gov/gene/419056 is SPCS2.
Updated database and package-- for some reason biomart (double checked its latest version) and the GTF file I downloaded straight from ENSEMBL has inconsistencies with Gene names that are listed on NCBI connected to an ENSEMBL ID.
When looking up some of their gene names/NCBIs, the Ensembl IDs dont match to the gene names on NCBI.
clusterProfiler bitr() worked best, however, their are still duplicate ENSEMBL IDs because they map to multiple NCBIs, so again I just took the first ENSEMBL in a duplicate.
Following this, their were still 152 unique ENSEMBL IDs that map to the same NCBI ID as another ENSEMBL ID. When I manually checked these, some are not correct on NCBI, some of the ENSEMBL IDs arent listed on NCBI at all. So I am manually correcting these, I cant think of a better way. Not sure what to do about the ENSEMBL IDs that dont exist on NCBI. I suppose I could remove them and re run DE I guess, but I see potential statistical problems