This is a follow up/crosspost on my question on the bioconductor list. I have two related questions:
On ensembl, if a gene is annotated with a specific GO term (e.g. GO:0005634 nucleus), how does ensembl decides whether parent GO terms are also included (e.g. the parents of GO:0005634)?
Before GO enrichment, like GSEA, shouldn't the gene annotations be augmented to included all the parent terms?
This is an example to illustrate my first question. Ensembl/biomart tells me that gene
ENSG00000281813 has the term
GO:0005634, nucleus, from ontology cellular component. However the parents of
GO:0005634 are not included in the annotation, at least not all of them. For example, the topmost parent,
GO:0005575 (cellular component), is not there:
library(biomaRt) mart <- useEnsembl("ensembl", "hsapiens_gene_ensembl", version=107) gos <- getBM(filters=c('ensembl_gene_id'), attributes=c('ensembl_gene_id', 'go_id', 'name_1006', 'namespace_1003'), value=list(c('ENSG00000281813')), mart) gos ensembl_gene_id go_id name_1006 namespace_1003 1 ENSG00000281813 2 ENSG00000281813 GO:0005634 nucleus cellular_component 3 ENSG00000281813 GO:0046872 metal ion binding molecular_function 4 ENSG00000281813 GO:0016740 transferase activity molecular_function 5 ENSG00000281813 GO:0006355 regulation of transcription, DNA-templated biological_process 6 ENSG00000281813 GO:0003677 DNA binding molecular_function 7 ENSG00000281813 GO:0006325 chromatin organization biological_process 8 ENSG00000281813 GO:0016746 acyltransferase activity molecular_function 9 ENSG00000281813 GO:0006334 nucleosome assembly biological_process 10 ENSG00000281813 GO:0000786 nucleosome cellular_component 11 ENSG00000281813 GO:0043966 histone H3 acetylation biological_process 12 ENSG00000281813 GO:0000123 histone acetyltransferase complex cellular_component 13 ENSG00000281813 GO:0045893 positive regulation of transcription, DNA-templated biological_process 14 ENSG00000281813 GO:0004402 histone acetyltransferase activity molecular_function 15 ENSG00000281813 GO:0016573 histone acetylation biological_process 16 ENSG00000281813 GO:0005515 protein binding molecular_function 17 ENSG00000281813 GO:0042393 histone binding molecular_function 18 ENSG00000281813 GO:0045892 negative regulation of transcription, DNA-templated biological_process 19 ENSG00000281813 GO:0061629 RNA polymerase II-specific DNA-binding transcription factor binding molecular_function 20 ENSG00000281813 GO:0005654 nucleoplasm cellular_component 21 ENSG00000281813 GO:0045944 positive regulation of transcription by RNA polymerase II biological_process 22 ENSG00000281813 GO:0003712 transcription coregulator activity molecular_function 23 ENSG00000281813 GO:0070776 MOZ/MORF histone acetyltransferase complex cellular_component 24 ENSG00000281813 GO:0050793 regulation of developmental process biological_process 25 ENSG00000281813 GO:1903706 regulation of hemopoiesis biological_process 26 ENSG00000281813 GO:0016407 acetyltransferase activity molecular_function
One would think that ensembl includes only the most specific terms since the parents are automatically implied. However, this is not the case. For example,
ENSG00000276595 does include the topmost term
GO:0005575 but also some of its offspring:
gos <- getBM(filters=c('ensembl_gene_id'), attributes=c('ensembl_gene_id', 'go_id', 'name_1006', 'namespace_1003'), value=list(c('ENSG00000276595')), mart) gos ensembl_gene_id go_id name_1006 namespace_1003 1 ENSG00000276595 GO:0016020 membrane cellular_component 2 ENSG00000276595 GO:0016021 integral component of membrane cellular_component 3 ENSG00000276595 GO:0005783 endoplasmic reticulum cellular_component 4 ENSG00000276595 GO:0005515 protein binding molecular_function 5 ENSG00000276595 GO:0003674 molecular_function molecular_function 6 ENSG00000276595 GO:0005575 cellular_component cellular_component **** 7 ENSG00000276595 GO:0097225 sperm midpiece cellular_component
Is there a reason for this seemingly inconsistent behaviour?
Regarding my second question, I believe that data straight from ensembl/biomart is not suitable for GSEA as implemented in e.g. fgsea since genes should be first augmented to include all parent terms of each gene. Am I right...?
Thanks Ben, but I'm not sure this answers my (first) question, or if it does it just moves it somewhere outside ensembl. I'm not asking where the GO terms come from for a given gene. Instead, I'm wondering why some genes contain a specific term AND some of the ancestors of that term while other genes contain only the specific terms. In my opinion, it would be more user-friendly if a gene annotated with a specific term also included all the ancestor terms. If I'm not mistaken (and this is my second question), data retrieved from ensembl/biomart is not suitable for fgsea since genes are not fully annotated with parent terms.