How ensembl assigns parent GO terms and its consequence on GO analysis
2
0
Entering edit mode
2.0 years ago

This is a follow up/crosspost on my question on the bioconductor list. I have two related questions:

  • On ensembl, if a gene is annotated with a specific GO term (e.g. GO:0005634 nucleus), how does ensembl decides whether parent GO terms are also included (e.g. the parents of GO:0005634)?

  • Before GO enrichment, like GSEA, shouldn't the gene annotations be augmented to included all the parent terms?

This is an example to illustrate my first question. Ensembl/biomart tells me that gene ENSG00000281813 has the term GO:0005634, nucleus, from ontology cellular component. However the parents of GO:0005634 are not included in the annotation, at least not all of them. For example, the topmost parent, GO:0005575 (cellular component), is not there:

library(biomaRt)

mart <- useEnsembl("ensembl", "hsapiens_gene_ensembl", version=107)

gos <- getBM(filters=c('ensembl_gene_id'), attributes=c('ensembl_gene_id', 'go_id', 'name_1006', 'namespace_1003'), value=list(c('ENSG00000281813')), mart)
gos
   ensembl_gene_id      go_id                                                           name_1006     namespace_1003
1  ENSG00000281813                                                                                                  
2  ENSG00000281813 GO:0005634                                                             nucleus cellular_component
3  ENSG00000281813 GO:0046872                                                   metal ion binding molecular_function
4  ENSG00000281813 GO:0016740                                                transferase activity molecular_function
5  ENSG00000281813 GO:0006355                          regulation of transcription, DNA-templated biological_process
6  ENSG00000281813 GO:0003677                                                         DNA binding molecular_function
7  ENSG00000281813 GO:0006325                                              chromatin organization biological_process
8  ENSG00000281813 GO:0016746                                            acyltransferase activity molecular_function
9  ENSG00000281813 GO:0006334                                                 nucleosome assembly biological_process
10 ENSG00000281813 GO:0000786                                                          nucleosome cellular_component
11 ENSG00000281813 GO:0043966                                              histone H3 acetylation biological_process
12 ENSG00000281813 GO:0000123                                   histone acetyltransferase complex cellular_component
13 ENSG00000281813 GO:0045893                 positive regulation of transcription, DNA-templated biological_process
14 ENSG00000281813 GO:0004402                                  histone acetyltransferase activity molecular_function
15 ENSG00000281813 GO:0016573                                                 histone acetylation biological_process
16 ENSG00000281813 GO:0005515                                                     protein binding molecular_function
17 ENSG00000281813 GO:0042393                                                     histone binding molecular_function
18 ENSG00000281813 GO:0045892                 negative regulation of transcription, DNA-templated biological_process
19 ENSG00000281813 GO:0061629 RNA polymerase II-specific DNA-binding transcription factor binding molecular_function
20 ENSG00000281813 GO:0005654                                                         nucleoplasm cellular_component
21 ENSG00000281813 GO:0045944           positive regulation of transcription by RNA polymerase II biological_process
22 ENSG00000281813 GO:0003712                                  transcription coregulator activity molecular_function
23 ENSG00000281813 GO:0070776                          MOZ/MORF histone acetyltransferase complex cellular_component
24 ENSG00000281813 GO:0050793                                 regulation of developmental process biological_process
25 ENSG00000281813 GO:1903706                                           regulation of hemopoiesis biological_process
26 ENSG00000281813 GO:0016407                                          acetyltransferase activity molecular_function

One would think that ensembl includes only the most specific terms since the parents are automatically implied. However, this is not the case. For example, ENSG00000276595 does include the topmost term GO:0005575 but also some of its offspring:

gos <- getBM(filters=c('ensembl_gene_id'), attributes=c('ensembl_gene_id', 'go_id', 'name_1006', 'namespace_1003'), value=list(c('ENSG00000276595')), mart)
gos
  ensembl_gene_id      go_id                      name_1006     namespace_1003
1 ENSG00000276595 GO:0016020                       membrane cellular_component
2 ENSG00000276595 GO:0016021 integral component of membrane cellular_component
3 ENSG00000276595 GO:0005783          endoplasmic reticulum cellular_component
4 ENSG00000276595 GO:0005515                protein binding molecular_function
5 ENSG00000276595 GO:0003674             molecular_function molecular_function
6 ENSG00000276595 GO:0005575             cellular_component cellular_component ****
7 ENSG00000276595 GO:0097225                 sperm midpiece cellular_component

Is there a reason for this seemingly inconsistent behaviour?

Regarding my second question, I believe that data straight from ensembl/biomart is not suitable for GSEA as implemented in e.g. fgsea since genes should be first augmented to include all parent terms of each gene. Am I right...?

gsea ensembl fgsea biomart go • 1.3k views
ADD COMMENT
1
Entering edit mode
2.0 years ago

That particular case is an error, as an annotation to a root term ( ‘GO:0008150 biological_process’, ‘GO:0003674 molecular_function’, or ‘GO:0005575 cellular_component) infers that no information could be obtained at the time of annotation. If there is a more recent annotation to any non-root term, the root annotation should be removed at that time.

You can tell these are old annotations as they are from 20110107. I have contacted UniProt to let them know about these and asked them to at least remove the CC (the non-root MF annotation was made by IntAct).

As a rule, annotations for a single entity (gene/geneproduct/complex etc.) made to the same reference and evidence should always be made to the most specific term possible.

ADD COMMENT
0
Entering edit mode
2.0 years ago
Ben Moore ★ 2.4k

Hi Dariober,

Ensembl associates GO terms to genes via UniProt mappings. The three-letter evidence codes found in the GO table son the Ensembl web pages (https://www.ensembl.org/Homo_sapiens/Gene/Ontologies/molecular_function?db=core;g=ENSG00000221914;r=8:26291508-26372680) refer to the evidence used for the initial assignment of GO terms to UniProt records.

ADD COMMENT
0
Entering edit mode

Thanks Ben, but I'm not sure this answers my (first) question, or if it does it just moves it somewhere outside ensembl. I'm not asking where the GO terms come from for a given gene. Instead, I'm wondering why some genes contain a specific term AND some of the ancestors of that term while other genes contain only the specific terms. In my opinion, it would be more user-friendly if a gene annotated with a specific term also included all the ancestor terms. If I'm not mistaken (and this is my second question), data retrieved from ensembl/biomart is not suitable for fgsea since genes are not fully annotated with parent terms.

ADD REPLY

Login before adding your answer.

Traffic: 1776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6