Question

Making sense of Gene Ontology graph relations

0

Entering edit mode

10.0 years ago

LankyCyril • 0

I'm trying to understand the logic behind the Gene Ontology annotations.

Let's take one gene, for example: ENSG00000198570. When passed into BioMart, it tells me there are three GO term accession IDs. Visualized within the GO tree, they look like this:

All three of them are offspring of biological_process; two of them are offspring of single-organism process.

Ultimately, I want to be able to analyze quite a big set of genes and see whether they cluster into big and/or small groups with same function. Therefore, e.g., if both are reported as offspring of protein binding, I would be able to immediately know that they are protein binding and biological_process themselves.

Right now the only option seems to be to traverse the GeneOntology XML, bottom to top, for each GO term, but it's stupidly inefficient. Maybe there's something obvious I'm missing or there's a piece of software out there that can do just what I need?

I hope what I'm saying is making sense to you.

function gene gene-ontology • 2.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by LankyCyril • 0

Ram · Answer 1 · 2014-05-08

1

Entering edit mode

10.0 years ago

mikhail.shugay 3.5k

The reported terms are the most specific categories (i.e. internal vertices/leaves in the GO tree of the lowest level) for that gene. If I understood you correctly, you're asking why parent vertices are not included. Well this would make GO annotation far more redundant, increase data size and make it harder (in some way) to analyze.

To have a look how the task of grouping genes while taking into account interrelationships between GO categories can be solved this paper could be a good starting point. It describes the algorithm EASE which is implemented in DAVID web service. Of course there might be some novel cool stuff in this field.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by mikhail.shugay 3.5k

0

Entering edit mode

You're right; including parent vertices would bulk up the database considerably... The source of my confusion was probably the fact that for some genes in my dataset it did report just stuff like "growth" or "protein binding", not some more specific function, so at first I thought that it would report parent vertices as well.

Thanks for the links!

I have edited the original question a bit in the wake of your answer... Just in case someone else comes along and proposes a rival solution.

ADD REPLY • link 10.0 years ago by LankyCyril • 0

0

Entering edit mode

I suggest that the situation you've described with reporting more high-level categories for some genes could be explained as follows: it is harder and more evidence-demanding to assign a more concrete category to a gene, and the genes that are less-studied are classified with a lower specificity

ADD REPLY • link 10.0 years ago by mikhail.shugay 3.5k