Question

Converting Ensembl gene IDs to Entrez IDs for indexing gene set collections for doing "Ensemble of Gene Set Enrichment Analyses"

0

Entering edit mode

4.7 years ago

Farah ▴ 80

Hello. I am doing ensemble gene set testing with EGSEA (Ensemble of Gene Set Enrichment Analyses) from the edgeR output as follows:

> dge <- edgeR::calcNormFactors(dge, method = "TMM")
> design <- model.matrix(~ 0 + cell + trt, data = dge$samples)
> contr.matrix = makeContrasts(
           BT20vsBT474 = BT20 - BT474,
           BT20vsBT549 = BT20 - BT549,
           BT474vsBT549 = BT474 - BT549,
           levels = colnames(design)[1:3])
> v <- voom(dge, design, plot=FALSE)

However, I found that after voom transformation, I need to build an index for each gene set collection using the EGSEA indexing functions (buildIdx) which relies on Entrez gene IDs only. However, in my dge and v, I have ensemble gene IDs. So, to convert my ensemble gene IDs to entrez gene IDs, I used biomaRt library as below:

To trim off the version of ensembl gene ids before biomaRt and then do biomaRt:

> library(biomaRt)
> v$genes <- gsub('\\..+$', '', v$genes$gene.id)
> ensembl.genes <- v$genes
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> genes <- getBM(filters = "ensembl_gene_id",
                  attributes = c("ensembl_gene_id","entrezgene_id"),
                  values = ensembl.genes, 
                  mart = mart)

Build an index:

> library(EGSEA)
> entrez_id <- data.frame(genes$entrezgene_id)
> gs.annots = buildIdx(entrezIDs= entrez_id, species="human", 
             msigdb.gsets=c("c2", "c5"), go.part = TRUE)

But, I faced with the below Error which shows there are many ensemble IDs that do not have corresponding entrez IDs:

[1] "Loading MSigDB Gene Sets ... "
[1] "Loaded gene sets for the collection c2 ..."
[1] "Indexed the collection c2 ..."
None of the genes in c2 are mapped to your gene IDs
[1] "Loaded gene sets for the collection c5 ..."
[1] "Indexed the collection c5 ..."
None of the genes in c5 are mapped to your gene IDs
[1] "Building KEGG pathways annotation object ... "
Error in data.frame(ID = gsets.ids, GeneSet = gsets.names, NumGenes = paste0(sapply(gsets,  :
  arguments imply differing number of rows: 0, 1

Would you please help me how can I fix this problem to have the same number of rows from ensemble and entrez ID columns (with no NAs)? Or should I have obtained gene entrez IDs, instead of ensemble IDs, from previous steps of the RNAseq analysis like read mapping or Differential gene expression analysis? Thank you so much.

RNA-Seq edgeR EGSEA limma biomaRt • 2.9k views

ADD COMMENT • link updated 4.7 years ago by Ram 43k • written 4.7 years ago by Farah ▴ 80

2

Entering edit mode

You have to understand that Ensembl and Entrez are different resources with possibly different notion of what a gene is so it is not surprising that some genes in one don't exist in the other. Pick a reference and stick to it. This means working from the beginning with one resource or the other and avoiding going back and forth. From the doc, EGSEA should be able to work with Ensembl gene IDs if you use buildCustomIdx().

ADD REPLY • link 4.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks a lot for your advice. I know that different databases have different gene notions and it is expected not to have annotations for all genes. But, as you suggested, I tried to use buildCustomIdx() for my list of ensembl genes (v$genes$gene.id). However, again I got the same Error:

> library(EGSEA)
> library(EGSEAdata)
> egsea.data("human")
> info = egsea.data("human", returnInfo = TRUE)
> gsets = list(info$msigdb$info$collections[c(3, 6)])
> gs.annots <- buildCustomIdx(geneIDs=v$genes$gene.id, gsets=gsets, species = "human")

Error in data.frame(ID = paste0(label, seq(1, length(gsets.idx))), GeneSet = gsets.names) : arguments imply differing number of rows: 2, 0

The class of v$genes$gene.id is "character" and the class of gsets = list(info$msigdb$info$collections[c(3, 6)]) is "list".

I searched a lot, but I could not find how to fix this Error. I will be grateful for any help to fix this problem. Many thanks.

ADD REPLY • link 4.7 years ago by Farah ▴ 80

0

Entering edit mode

The error is generated because gsets doesn't contain what it should (i.e. a list of gene sets). If you want to use MSigDB, it seems you need buildMSigDBIdx() or you have to extract/build the gene sets in gsets yourself for buildCustomIdx().

ADD REPLY • link 4.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you very much for your reply. I tried to extract the "c2" and "c5" gene sets of MSigDB as "list" class through gsets = list(info$msigdb$info$collections[c(3, 6)]), and class(gsets) is also a list as required for gsets argument of buildCustomIdx(), but it does not work. I tried many ways to extract them with the class of list, but I could not find how to do it. I would highly appreciate if you can give me any clues and advice about it. Thanks a lot.

ADD REPLY • link 4.7 years ago by Farah ▴ 80

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

Note: Crossposted on the Bioconductor forum.

Avoid crossposting or at the very least link the posts.

ADD REPLY • link 4.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your message, and sorry that I did not know I can not ask my question in different forums. I thought Bioconductor team may also guide me in using the EGSEA Bioconductor package.

ADD REPLY • link 4.7 years ago by Farah ▴ 80