Question

Why does each GO enrichment method give different results?

14

Entering edit mode

9.6 years ago

pixienon ▴ 150

I'm new to GO terms. In the beginning it was fun, as long as I stuck to one algorithm. But then I found that there are many out there, each with its own advantages and caveats (the quality of graphic representation, for instance).

I soon realised that each GO enrichment method gives different results! I would tolerate small differences, especially in the p-values, but why does term number 1 in the first algorithm doesn't even appear on the list in the second algorithm? Why does one recognise 213 genes in my list as "Defense response", while the other only 17? I'm not a bioinformatician. I acknowledge that different calculation methods can result in different results, but I thought that GO annotations were pretty robust. I also realise the issue with different levels in the hierarchy of GO, but this isn't it.

As a biologist, what should I trust? Deciding on this or that algorithm may change the whole story/article!

(Algorithms I've used so far: AgriGO, BinGO, ClueGO, VirtualPlant and PANTHER).

GO-term GO-annotation • 12k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by pixienon ▴ 150

4

Entering edit mode

In the beginning it was fun, as long as I stuck to one algorithm.

Ok that is really funny. Quote of the month. Also sadly it is mostly true.

I agree and feel your pain - IMHO this field of bioinformatics is a bit of a mess.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Istvan Albert 101k

0

Entering edit mode

Follow up question. How on earth can you compare the lists of GO result? Some items are probably found in common, but others will be inexact matches like parent/child node in the tree. Is there any quantifiable metric for the similarity of GO lists?

I think this is an open question for Biostars in another thread, but might be worth asking here when people are thinking about the different kinds of results different GO tools make.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by karl.stamm 4.1k

0

Entering edit mode

A few ways:

Compare the size of clusters from each tool, see where clusters are expanded or contracted and what genes are different between the two
Look for the overall number of clusters, and the specificity of clusters. Is it a very detailed term, or does it seem more general? Does one tool return more or less of one type?
Look at all of the genes each tool returns, are they in multiple clusters, are they small or large clusters?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by pld 5.1k

0

Entering edit mode

This could be partially due to gene identifiers that are "valid" with one tool and not another - see if different tools are throwing out swaths of genes prior to the test due to not finding a match for that id in whatever their definition of a gene is.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Madelaine Gogol 5.3k

Ram · Answer 1 · 2014-12-08

To start, you should understand what Gene Ontology actually is. GO is a means of providing consistent descriptions of genes/gene products between various databases and projects. GO is a language, not data. Annotation sets of genes from a species with terms from GO are the data you need.

Second, this brings me to PANTHER. PANTHER isn't GO, it is a totally different set of annotations, so it makes sense you're getting different results than the tools relying on GO annotated genes.

Next, you want to make sure that there aren't any differences in the annotation datasets used between platforms. Some websites may be using different versions of the annotation data, some may be using custom or non-standard data.

AgriGO:

Raw GO annotation data is generated using BLAST, Pfam, InterproScan by agriGO or obtained from B2G-FAR center or from Gene Ontology.

BinGO:

Q : The default annotations/ontologies in BiNGO are already several months old. I would like to use more recent annotations.

A : Download the most recent annotation and ontology files from the GO website. You can use these as custom annotation/ontology files.

So you need to understand that there might be differences in the GO associations.

Finally, you need to understand the process that is going on here. The GO annotations are simple curations, GO enrichment is a totally different thing. This is where the specific algorithm and parameters each tool is using really start to matter. Are you getting 17 instead of 213 because some parameter is different. Or, is it just because they're different algorithms?

Never assume that just because two algorithms claim to solve the same problem that you will get the same results. They're doing different things, making different assumptions, require different input data and so on. Also, the semantics really matter, what each algorithm is actually telling you can be different. Never make this assumption unless you have evidence and knowledge to support it.

Also, look back at your data and see which one makes sense. Maybe the program that returned 213 results didn't account for p-value or fold change in your differential expression. Look back at your data and see what makes sense, what do the p-values and fold changes of those 213 genes look like? Are they all very strong or just a few? If it is a dozen out of the 213 that are strongly DE, maybe 17 is better than 213.

A word of advice, never judge the performance of an algorithm based on the plots the programmer makes with the algorithm's output. In other words, what the algorithm is doing is the important thing. You can always make pretty plots with good data, but a pretty plot with bad data is just bad. Find the tool that gives you the best quality results, deal with the plots later.

Pull the pubs on these tools and the methods they use and figure out what they're actually doing. Look at the results you get versus the data you put in, do things make sense? Look at the annotation data used by each tool, is it the same? Can you upload the same GOA data into each tool so that aspect is controlled for? Check and experiment with parameters, how does performance across tools vary with different parameters?

A final bit of advice with these sorts of tools is to always look at version history and release dates. BinGO was last updated 4 years ago, but AgriGO was updated this past August. You should always make sure the software you're using references current databases and that updates for bug fixes have occurred. The GOA version is very important, GO, the species genome, the annotation of genes in that genome and the final GOA product are all moving targets.

These are all good tips for any bioinformatics tools or usage cases.