Question

GO enrichment analysis with human ENSEMBL ids

2

Entering edit mode

6.9 years ago

zhang.jianhai ▴ 30

Dear BioStar communities,

I am analyzing RNA seq counts from GTEx website with degeR. The goana() function requires Entrez and RefSeq ids.

The original count data only contains ENSEMBL ids, so I need to map them to Entrez and RefSeq. The problem is one ENSEMBL id can map to multiple Entrez id, and one Entrez id can map to multiple RefSeq id. This makes it difficult to annotate "genes" in the "DGEList".

E.g.: Ensembl "ENSG00000223972" mapped to 4 Entrez ids: "84771" "727856" "100287102" and Entrez "84771" mapped to two RefSeq ids: "NR_024004" "NR_024005"

So my question is how to address the mapping problems? Which is not one-to-one mapping.

Alternatively, how can I perform GO enrichment analysis with given human Ensembl ids?

Thanks a lot.

Regards,

Jianhai

R RNA-Seq ensembl GO enrichment edgeR • 6.6k views

ADD COMMENT • link updated 6.8 years ago by Nan Xiao • 0 • written 6.9 years ago by zhang.jianhai ▴ 30

1

Entering edit mode

Hi,

If you got Gene Symbols (HGNC), use GeneSCF to get complete annotation for all your input genes. All ENSGs (Ensembl Genes) will have corresponding Gene Symbol (you can find in GTF or GFF3 from Ensembl). To avoid this problem I personally prefer to use only Ensembl IDs and Gene Symbols throughout the analysis and also maintain the same version of annotation.

ADD REPLY • link 6.9 years ago by EagleEye 7.5k

0

Entering edit mode

Where are you getting your mappings? ENSG00000223972 is only Entrez ID 100287102, the rest are different genes.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hello Devon Ryan,

The mapping is as below: library(org.Hs.eg.db) x <- as.list(org.Hs.egENSEMBL2EG); x["ENSG00000223972"]

Thanks. Jianhai

ADD REPLY • link 6.9 years ago by zhang.jianhai ▴ 30

0

Entering edit mode

That R package apparently has some errors, since the example mapping is incorrect. Please report this upstream to the package maintainer.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Maybe you can use Biomart R package.When I use org.Hs.eg.db to map ENSEMBL id to gene symbol, I could get two gene symbols with one ENSEMBL id.

ADD REPLY • link 6.9 years ago by Javier Zhang • 0

0

Entering edit mode

That's the reason to avoid confusion I asked you to use GTF/GFF3 to convert your ENSGs to Gene Symbols.

ADD REPLY • link 6.9 years ago by EagleEye 7.5k

0

Entering edit mode

Hello everyone,

Thanks for all your reply.

In the original data, I already have gene symbols along Ensembl ids, no Entrez and RefSeq. My fundamental goal is GO enrichment analysis, preferably with these Ensembl ids. Does anyone have ideas?

Thanks.

Regards, Jianhai

ADD REPLY • link 6.9 years ago by zhang.jianhai ▴ 30

2

Entering edit mode

Your problem seems to stem from mixing two different gene sets. You have to understand that different resources have different notions of what a gene is. EnsEMBL provides one set of genes as part of its annotation of the human genome. RefSeq on the other hand is just a collection of sequences, some of them assigned to genes. RefSeqGene is a subset of RefSeq that "defines genomic sequences to be used as reference standards for well-characterized genes". While EnsEMBL has at least an operational definition of what a gene is (roughly, a locus producing a set of related, overlapping transcripts), I still haven't found anything explaining what a gene is in RefSeq. As already suggested by EagleEye, my advice for data analysis is to decide on which genome reference you want to use for the project and stick to it.

ADD REPLY • link 6.9 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Since you already have your Gene Symbols, you can use GeneSCF to perform enrichment analysis. I hope you noticed that in my last comment. If you have any difficulties in using GeneSCF, I am here to help you with it.

ADD REPLY • link 6.9 years ago by EagleEye 7.5k

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

score 2 · Answer 1 · 2017-05-28

2

Entering edit mode

6.9 years ago

Whoknows ▴ 960

Hi

You could use David Functional Annotation or enrichr, although enrichr works by Gene symbol, but you could easily convert Ensemble ID to gene symbol by ensemble biomart.

ADD COMMENT • link 6.9 years ago by Whoknows ▴ 960

0

Entering edit mode

Dear Whoknows,

Thanks for your reply. I already have both Ensembl ids and gene symbols in the count table. My fundamental goal is GO enrichment analysis. Do you have ideas on how to perform GO enrichment with human Ensembl ids?

Regards, Jianhai

ADD REPLY • link 6.9 years ago by zhang.jianhai ▴ 30

0

Entering edit mode

try David functional annotation, it works for human

ADD REPLY • link 6.9 years ago by Whoknows ▴ 960

score 0 · Answer 2 · 2017-07-06

0

Entering edit mode

6.8 years ago

Nan Xiao • 0

You could try the R package grex that I wrote.

The grex package offers a fast and minimal dependency solution for mapping Ensembl gene IDs to Entrez IDs, HGNC gene symbols, and UniProt IDs, specifically designed for GTEx data. See the package vignette here to get started.

ADD COMMENT • link 6.8 years ago by Nan Xiao • 0