Question

go term analysis with ensembl gene id

0

Entering edit mode

7.2 years ago

yuxinghai ▴ 10

I get some ensembl gene id after gene different expression analysis with DEseq2. I want to perform GO enrichment analysis, but almost half of them can't be recognized by DAVID. some people said I could use biomart in ensembl to get corresponding GO term of each gene, but what should I next do?

RNA-Seq gene go ensembl • 7.6k views

ADD COMMENT • link updated 7.2 years ago by Jean-Karim Heriche 27k • written 7.2 years ago by yuxinghai ▴ 10

1

Entering edit mode

Give GeneSCF a try. ~~It supports Ensembl ID's.~~

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

Sorry to say this. GeneSCF does not support Ensembl IDs directly. But you can convert into Gene Symbols and Entrez ids and use it in GeneSCF.

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

It's a pity that it doesn't work with EnsEMBL. In my work I find EnsEMBL a much better resource than NCBI.

ADD REPLY • link 7.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

It was problem when I try to implement Ensembl with GeneSCF. Because for some of the GeneSymbols the Ensembl ID (ENSG) is varying depending on the version of Ensembl.

Example, for KCNQ1OT1, I can see different ENSG-ID in old Ensembl (ENSG00000258492.1, GRCh37.66, gencode v11) and new Ensembl (ENSG00000269821.1, GRCh37.74-75, gencode v19). Only thing constant here was Gene Symbol or Entrez ID for this gene.

Atleast if I have something constant (fixed) like Gene Symbols (I can easily deal with multiple alias) or Entrez IDs, I can use it confidently (Otherwise, this might mislead).

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

Don't use the .x version number of EnsEMBL IDs, they should be more stable this way. Gene symbols are also not stable (although I must say they change less often than they used to a few years ago). Also the whole problem is to define what a gene is and work with this definition in a consistent way. It seems that for you a gene is defined by whatever share the same symbol. This is reasonable as this is more or less the definition used by biologists but as you've already experienced, it can create computational problems. It is also not always the best definition to use, especially when the underlying genome matters. The problem with Entrez is that it is unclear what a gene is. From this paper:

A GeneID is usually assigned to what is annotated as a gene on a RefSeq record. ... A GeneID may also be assigned when no RefSeq exists.

And from the RefSeq book section on curation:

A sequence record unambiguously associated with a Gene record may be propagated into a RefSeq record.

This looks very circular and ad hoc to me.

A RefSeq record is suppressed if it is found to represent a transcribed repeat element, ... or not to represent a "gene".

Notice the quote around the word gene, which I take to indicate there's no formal definition of the term.

Anyway, the conclusion is that there are different definitions of what a gene is and that one should pick a reference and stick to it for the duration of a project or risk inconsistent results.

ADD REPLY • link 7.2 years ago by Jean-Karim Heriche 27k

score 2 · Answer 1 · 2017-02-21

2

Entering edit mode

7.2 years ago

Jean-Karim Heriche 27k

You could use an R package like topGO or one of the Babelomics enrichment tools.

ADD COMMENT • link 7.2 years ago by Jean-Karim Heriche 27k

score 1 · Answer 2 · 2017-02-21

1

Entering edit mode

7.2 years ago

EagleEye 7.5k

Suggestion:

1) Using BioMart convert your Ensembl (ENSG) Ids into Gene Symbols or Entrez GeneIDs (check steps here).

2) Use GeneSCF to do enrichment analysis.

ADD COMMENT • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

but many ensemble gene id don't have corresponding Entrez ids.

ADD REPLY • link 7.2 years ago by yuxinghai ▴ 10

0

Entering edit mode

All Ensembl IDs will have corresponding GeneSymbols. You can use that information.

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

score 1 · Answer 3 · 2017-02-21

1

Entering edit mode

7.2 years ago

Benn 8.3k

With goseq in R you can use ensemble IDs.

Or clusterProfiler, which has a good tutorial:

http://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html

ADD COMMENT • link 7.2 years ago by Benn 8.3k