Question: Making sense of Gene Ontology graph relations
gravatar for LankyCyril
6.1 years ago by
Russian Federation
LankyCyril0 wrote:

I'm trying to understand the logic behind the Gene Ontology annotations.

Let's take one gene, for example: ENSG00000198570. When passed into BioMart, it tells me there are three GO term accession IDs. Visualized within the GO tree, they look like this:

All three of them are offspring of biological_process; two of them are offspring of single-organism process.

Ultimately, I want to be able to analyze quite a big set of genes and see whether they cluster into big and/or small groups with same function. Therefore, e.g., if both are reported as offspring of protein binding, I would be able to immediately know that they are protein binding and biological_process themselves.


Right now the only option seems to be to traverse the GeneOntology XML, bottom to top, for each GO term, but it's stupidly inefficient. Maybe there's something obvious I'm missing or there's a piece of software out there that can do just what I need?

I hope what I'm saying is making sense to you.

gene ontology function gene • 2.0k views
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by LankyCyril0
gravatar for mikhail.shugay
6.1 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

The reported terms are the most specific categories (i.e. internal vertices/leaves in the GO tree of the lowest level) for that gene. If I understood you correctly, you're asking why parent vertices are not included. Well this would make GO annotation far more redundant, increase data size and make it harder (in some way) to analyze.

To have a look how the task of grouping genes while taking into account interrelationships between GO categories can be solved this paper could be a good starting point. It describes the algorithm EASE which is implemented in DAVID web service. Of course there might be some novel cool stuff in this field.

ADD COMMENTlink modified 5 months ago by RamRS27k • written 6.1 years ago by mikhail.shugay3.4k

You're right; including parent vertices would bulk up the database considerably... The source of my confusion was probably the fact that for some genes in my dataset it did report just stuff like "growth" or "protein binding", not some more specific function, so at first I thought that it would report parent vertices as well.

Thanks for the links!

I have edited the original question a bit in the wake of your answer... Just in case someone else comes along and proposes a rival solution.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by LankyCyril0

I suggest that the situation you've described with reporting more high-level categories for some genes could be explained as follows: it is harder and more evidence-demanding to assign a more concrete category to a gene, and the genes that are less-studied are classified with a lower specificity

ADD REPLYlink written 6.1 years ago by mikhail.shugay3.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour