Question: ENSEMBL IDs 2 Entrez Gene IDs - what to do if no match?
gravatar for lech.kaczmarczyk
3.2 years ago by
lech.kaczmarczyk50 wrote:

Hi All, I have gene expression data with ENSEMBL Ids (ENSG00000XXXXXXX). I tried 3 different packages to convert them to ENTREZ IDs (bitr, biomatRt, AnnotationDb), but I consistently get no match for about 5-6% of the genes. I would like to do GO and GSEA, but most GO and GSEA tools require gene symbols or entrez IDs. This problem bugs me for a while already. How to handle this? I work with mouse genes.

Here are the example of what I am doing:

MyTargetList$entrez <- mapIds(,
                       column ="ENTREZID",

Or with biomaRt:

ensembl = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene"),
              mart = ensembl )

and then match function to populate the column.

But there seem to be gaps in the databases:

> head(genemap)
     ensembl_gene_id entrezgene
1 ENSMUSG00000064336         NA
2 ENSMUSG00000064337         NA
3 ENSMUSG00000064338         NA
4 ENSMUSG00000064339         NA
5 ENSMUSG00000064340         NA
6 ENSMUSG00000064341      17716

Cheers, Lech

ensembl entrez biomart annotation • 2.5k views
ADD COMMENTlink modified 3.2 years ago by Emily_Ensembl21k • written 3.2 years ago by lech.kaczmarczyk50

There's no way to map all Ensembl IDs to Entrez Gene IDs, the latter is a much smaller dataset than the former.

ADD REPLYlink written 3.2 years ago by Devon Ryan97k

It's probably useful if you add a few examples for which you can't find a match for us to replicate your issue.
In addition, showing, the code you used in one of those packages could allow us to spot a mistake.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k

Hi, thanks for quick reply. I did add the examples. Since I get most of the records, I would assume it's just missing records in the database (see NAs after retriving biomaRt annotations).

ADD REPLYlink written 3.2 years ago by lech.kaczmarczyk50

Most of the NAs are mitochondrial genes.

ADD REPLYlink written 3.2 years ago by cpad011214k

All of them are non-protein-coding.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k

not all as it seems:

> IP_toptreatRT0$entrez <- genemap$entrezgene[match(rownames(IP_toptreatRT0), genemap$ensembl_gene_id)]
> IP_toptreatRT0$biotype <- genemap$gene_biotype[match(rownames(IP_toptreatRT0), genemap$ensembl_gene_id)]
> IP_toptreatRT0[$entrez) & IP_toptreatRT0$biotype == "protein_coding",]
                        logFC     AveExpr         t      P.Value    adj.P.Val entrez        biotype
ENSMUSG00000068099  1.7856553  6.96331268 16.472108 1.231735e-15 1.097134e-13     NA protein_coding
ENSMUSG00000089665  2.6199978  2.29635921 14.398467 3.279678e-14 1.960328e-12     NA protein_coding
ENSMUSG00000029632  1.5256810  8.62947339 13.462191 1.632542e-13 7.790043e-12     NA protein_coding
ENSMUSG00000058927  1.5661194  7.71168598 12.056272 2.404491e-12 8.048267e-11     NA protein_coding
ENSMUSG00000103034  1.0460229  9.25763535 11.672475 4.484786e-12 1.396829e-10     NA protein_coding
ENSMUSG00000110358  2.8209744 -0.01374836 10.572951 4.100013e-11 9.610454e-10     NA protein_coding
ENSMUSG00000024571  0.8170688  6.72830264  9.971918 1.462597e-10 2.835528e-09     NA protein_coding
ENSMUSG00000091228  1.0110602  7.00300093  9.890613 1.743265e-10 3.272807e-09     NA protein_coding
ENSMUSG00000110086  1.8677990  1.83456194  9.111705 9.784476e-10 1.481520e-08     NA protein_coding
ENSMUSG00000087403 -1.6533448  5.04195971 -8.730238 2.344445e-09 3.166680e-08     NA protein_coding
ENSMUSG00000021708 -1.3807745  6.35653891 -8.448122 4.529814e-09 5.643085e-08     NA protein_coding


ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by lech.kaczmarczyk50
gravatar for Alex Reynolds
3.2 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

It might help to retrieve HGNC names:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

mg = MyGeneInfo()

genes = ["ENSMUSG00000064336",

sys.stdout.write("%s\t%s\t%s\n" % ("ensembl", "hgnc", "entrezgene"))
for gene in genes:
    result = mg.query(gene, fields=["symbol", "entrezgene"], species="mouse", verbose=False)
    for hit in result['hits']:
    if 'symbol' not in hit:
            hit['symbol'] = "NA"
        if 'entrezgene' not in hit:
            hit['entrezgene'] = "NA"
        sys.stdout.write("%s\t%s\t%s\n" % (gene, hit['symbol'], hit['entrezgene']))

Sample run:

$ ./
ensembl                 hgnc    entrezgene
ENSMUSG00000064336      mt-Tf   NA
ENSMUSG00000064337      mt-Rnr1 NA
ENSMUSG00000064338      mt-Tv   NA
ENSMUSG00000064339      mt-Rnr2 NA
ENSMUSG00000064340      mt-Tl1  NA
ENSMUSG00000064341      ND1     17716

Looking at HGNC names in GeneCards or other resources, for example, may help with searching for Entrez Gene records that may not be available directly through these sources.

ADD COMMENTlink written 3.2 years ago by Alex Reynolds31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 917 users visited in the last hour