Question: How To Measure Gene Enrichement (Go, Kegg) Using R And Ensembl Ids ?
gravatar for Radhouane Aniba
9.3 years ago by
Radhouane Aniba760 wrote:


What tool do you advice me to use (R package ?) to measure gene set enrichment in GO/KEGG BUT having a list of Ensembl IDS not Entrez IDs

Most of the tools listed are using Entrez Ids and I would like also to know the reason to that ?



ADD COMMENTlink modified 9.2 years ago by Michael Kuhn5.0k • written 9.3 years ago by Radhouane Aniba760

Is there any particular reason to use only Ensembl IDs ? I think you if you can use an ID converter to map between Ensembl <-> Entrez you may get more tools to do this type of analysis.

ADD REPLYlink written 9.3 years ago by Khader Shameer18k

Sometimes mapping Ensembl IDs to Entrez IDs let's say using Biomart makes you loose some entries since this is not a 1-1 relationship, some Ensembl IDs do not have necessary Entrez equivalents

ADD REPLYlink written 9.3 years ago by Radhouane Aniba760
gravatar for Michael Kuhn
8.6 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

It looks like things have changed in the meantime: Now it is possible to directly work with Ensembl Genes. Note the annot =, mapping="":

biocLite(c("biomaRt", "topGO", ""))

all_genes <- rep(c(0,1,1), each=10)

names(all_genes) <- c("ENSMUSG00000031167", "ENSMUSG00000022443", "ENSMUSG00000055762", "ENSMUSG00000031328", "ENSMUSG00000026728", "ENSMUSG00000029722", "ENSMUSG00000059208", "ENSMUSG00000017404", "ENSMUSG00000003970", "ENSMUSG00000019210", "ENSMUSG00000024966", "ENSMUSG00000024197", "ENSMUSG00000030847", "ENSMUSG00000004264", "ENSMUSG00000041959", "ENSMUSG00000032643", "ENSMUSG00000042520", "ENSMUSG00000028484", "ENSMUSG00000022186", "ENSMUSG00000027593", "ENSMUSG00000019795", "ENSMUSG00000021134", "ENSMUSG00000026956", "ENSMUSG00000022982", "ENSMUSG00000027162", "ENSMUSG00000024668", "ENSMUSG00000000168", "ENSMUSG00000022234", "ENSMUSG00000024991", "ENSMUSG00000018697")

GOdata <- new("topGOdata", ontology = "BP", allGenes = all_genes, geneSel = function(p) p < 1e-2, description = "Test", annot =, mapping="", ID="Ensembl")

resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
GenTable(GOdata, classicFisher = resultFisher, topNodes = 10)
ADD COMMENTlink written 8.6 years ago by Michael Kuhn5.0k
gravatar for Sean Davis
9.3 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Try using the biomaRt Bioconductor package to map your Ensembl IDs to Entrez gene. Then you can apply the GOStats package or topGO.

ADD COMMENTlink written 9.3 years ago by Sean Davis26k

Nice, I have the same comment to Khader, please read it.

ADD REPLYlink written 9.3 years ago by Radhouane Aniba760
gravatar for Giulietta - Ensembl Helpdesk
9.3 years ago by
Cambridge, UK

GO terms are actually mapped to Ensembl IDs on the transcript level. You can get the associated mappings through BioMart by going in with the Ensembl IDs. Mappings are done through the UniProt ID. Ensembl does not map to KEGG, but Reactome is a pathway database that crosslinks to KEGG, and can be queried using Ensembl IDs. They also have a Mart:

I hope this helps.

ADD COMMENTlink modified 12 months ago by RamRS30k • written 9.3 years ago by Giulietta - Ensembl Helpdesk1.2k
gravatar for Duff
8.9 years ago by
United Kingdom
Duff660 wrote:

To get the EntrezGene IDs from the bioconductor annotation packages you can do:

ENSEMBL <- c('ENSG00000206454', 'ENSG00000111704', 'ENSG00000115232') # example IDs
egIDs <- unlist(mget(ENSEMBL, org.Hs.egENSEMBL2EG, ifnotfound = NA))

Personally I've had some problems with 'unlist' where more than one ENSEMBL ID links to a single Entrez ID or vice versa as illutrated in this example with ENSG00000111704. This maps to two ENTREZ ids and because R cannot handle duplicate names these ENTREZ ids in the results get named ENSG000001117041 and ENSG000001117042. This makes is harder to map back from ENTREZ to ENSEMBL.

Now I prefer:

egIDs <- stack(mget(ENSEMBL, org.Hs.egENSEMBL2EG, ifnotfound = NA))

Or you could use SQL directly on the database (these packages are SQLite databases; see the AnnotationDBI documentation).

dbCon <- org.Hs.eg_dbconn() # set up the database connection
sql <- 'SELECT * FROM ensembl, genes WHERE genes._id = ensembl._id;' # write your little bit of SQL
result <- dbGetQuery(dbCon, sql) # query the database

ENTREZ <- result[which(result[,2] %in% ENSEMBL), c(2,4)] # subset to get your results

Once you have the mapping you can use the ENTREZ centric tools although as others point out you may lose a few genes in the process. How this will affect the results is unknown though.



ADD COMMENTlink written 8.9 years ago by Duff660
gravatar for David Quigley
9.3 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

Warning: anecdotal opinion not supported by hard data.

Why are many tools targeted at Entrez Ids? I personally think Ensembl a very rich environment and a great toolset, but I spend 99% of my time in the NCBI universe, where entrez IDs are the common denominator. Most academic software packages grow out of what someone needed to write for themselves. My SuperUltraGO package would likely use Entrez IDs because I would write against the toolset I personally use. I would need a particular reason to make it eat Ensembl identifiers as well, even though that would be a good idea in theory.

(There is no SuperUltraGO package. Yet. I think.)

ADD COMMENTlink written 9.3 years ago by David Quigley11k

+1. I like Ensembl too for its rational way of organising genes, transcripts and proteins.

ADD REPLYlink written 9.3 years ago by Khader Shameer18k

To answer the question with regard to R/Bioconductor, that is historical, largely. The annotation packages are NCBI-centric and it is the organism-specific annotation packages that allow such tasks as GO or KEGG enrichment. It would be possible (and we have talked about doing it) to make annotation packages that are Ensembl or even UCSC centric, but we haven't gone there yet. That said, it IS possible for a user to build an annotation package using AnnotationDbi that would use Ensembl IDs as the keys.

ADD REPLYlink written 9.3 years ago by Sean Davis26k
gravatar for Tommy C
9.2 years ago by
Tommy C10
Tommy C10 wrote:

I recommend the Romer or Roast functions for Gene Set Analysis. They fit nicely in with Limma workflows and avoid problems of establishing a suitable cut-off for DE genes.



ADD COMMENTlink written 9.2 years ago by Tommy C10
gravatar for Mushal
8.9 years ago by
Cape Town
Mushal10 wrote:

Hi y you can modify CORNA,

ADD COMMENTlink written 8.9 years ago by Mushal10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2184 users visited in the last hour