Question: How Much Do You Trust Geneontology Annotations?
GeneOntology is a nice project to provide a standard terminology for genes and gene functions, to help avoid the use of synonyms and wrong spelling when describing a gene.

I have been using the GeneOntology for a while, but honestly I think that it contains many errors and that many terms have not enough terms associated. Moreover, the terminology they use is not always clear and there are some duplications.

It is frequent to read in article or in slideshows charts were the GO classification is used to infer the properties of a set of genes... But I wonder if the authors check the GO annotations they use.

What is your experience about GO?

First, sorry if my English is not good!

many terms have not enough terms associated

I presume that you want to say many [genes] have not... There is 2 things to take into account:

  • GO uses the True Path Rule, that is to say, if a gene is annotated by a term, it is also implicitly annotated by all the parents of this term, up to the root. Making this extension is crucial in term of inference (Seung Yon Rhee, Valerie Wood, Kara Dolinski, and Sorin Draghici. Use and misuse of the gene ontology annotations. Nature Reviews Genetics, 9(7):509-515, 2008,

  • All species and all metabolisms are not equal in term of annotation, the more a gene is studied, the more annotations it got.

there are some duplications

I asked GO for this, they answer me that each duplicated annotation has a different Evidence Code. It shows various level of study. So if you use GO to do some semantic enrichment or inference, think to delete all doubles. But if you are interested in Evidence Codes, doubles may serve you.

Evidence Codes represent a delicate point in Gene Ontology. I cite GO documentation: "Evidence codes are not statements of the quality of the annotation. Within each evidence code classification, some methods produce annotations of higher confidence or greater specificity than other methods, in addition the way in which a technique has been applied or interpreted in a paper will also affect the quality of the resulting annotation. Thus evidence codes cannot be used as a measure of the quality of the annotation."

So it bring an information, but it may not serve to quantify the quality of an annotation. It is a matter of higher confidence or greater specificity... The nuance is subtle.

I really recommend to read Rhee's article that I cite before for a better use of GO.

In my experience it's case by case. In other words just because you are getting significant p-values, does not mean the results are biologically significant. I once submitted clusters of microarray data and received a bunch of hits that were significant by p-value, but really didn't have a theme. The GO terms I saw were from many different processes without an overall term (besides biological process) which linked them together. When I've looked at published GO terms searches I generally see a strong theme among many of the terms (however that doesn't necessarily mean it has biological significance until tested empirically). So seeing themes among your terms may suggest higher significance, but it should make biological sense too.

If you find an error or have a suggestion for a go term, you can submit it on sourceforge.

You also may find useful this chart listing evidence codes and how the annotators came to them.

If you're wary of the results, you could try only using certain conservative evidence codes. Just like any biological database, it's a work in progress, but I think a lot of people have found it quite useful.

thank you for the answer. I did it already and I can assure you they are very keen to respond.

While I think that GO is very useful especially for exploratory data analysis. It does have a number of problems (its weird graph structure for one). I found that it was very useful for generating a feel for my data and what was going on (in the case of gsea of diff. exp GX data).

I read this paper a while ago:

Quantifying the biological signiļ¬cance of gene ontology biological processesā€”implications for the analysis of systems-wide data

and although I have a few minor issues with some of the methods, it has a very clear message in that the gene ontology does contain some terms and relationships which are artifacts of human annotations and should be removed prior to analysis (depending of course on the analysis) otherwise they will bias any statistics/conclusions.

Annotation bias of GO terms is problematic .... but unfortunately thats the way it works.

Statistically data can be qunatitative or qualitative and like Istan said it's more subjective which mean qualitative in GO terms. The P-value test is just measuring the extremity of the given test. So, we can't say GO terms to be right or wrong, instead we can rely it's accuracy.

I assure you that some terms can be simply wrong, as there are mistakes as in any other dataset. The good thing about GO is that you can look at their bug tracker on sf and see that. For example, once I found a wrong association between one gene and its localization GO.

The GO terms and classifications are primarily an based on opinions and a human interpretation of a small group of people of what the current state of the knowledge is.Thus are more subjective than say experimental measurements would be.

In fact it is surprising that it works at all; and it does indeed. We just need to becareful not too read to much into it.

it works because they are very active at developing it and the reviewing process is very transparent. If you look at their bug tracker on sf, they have a lot of discussion there.

