Why does each GO enrichment method give different results?
1
12
Entering edit mode
7.1 years ago
pixienon ▴ 120

I'm new to GO terms. In the beginning it was fun, as long as I stuck to one algorithm. But then I found that there are many out there, each with its own advantages and caveats (the quality of graphic representation, for instance).

I soon realised that each GO enrichment method gives different results! I would tolerate small differences, especially in the p-values, but why does term number 1 in the first algorithm doesn't even appear on the list in the second algorithm? Why does one recognise 213 genes in my list as "Defense response", while the other only 17? I'm not a bioinformatician. I acknowledge that different calculation methods can result in different results, but I thought that GO annotations were pretty robust. I also realise the issue with different levels in the hierarchy of GO, but this isn't it.

As a biologist, what should I trust? Deciding on this or that algorithm may change the whole story/article!

(Algorithms I've used so far: AgriGO, BinGO, ClueGO, VirtualPlant and PANTHER).

GO term GO annotation • 7.2k views
4
Entering edit mode

"In the beginning it was fun, as long as I stuck to one algorithm."

Ok that is really funny. Quote of the month. Also sadly it is mostly true.

I agree and feel your pain - IMHO this field of bioinformatics is a bit of a mess.

0
Entering edit mode

Follow up question. How on earth can you compare the lists of GO result? Some items are probably found in common, but others will be inexact matches like parent/child node in the tree. Is there any quantifiable metric for the similarity of GO lists?

I think this is an open question for Biostars in another thread, but might be worth asking here when people are thinking about the different kinds of results different GO tools make.

0
Entering edit mode

A few ways:

-Compare the size of clusters from each tool, see where clusters are expanded or contracted and what genes are different between the two

-Look for the overall number of clusters, and the specificity of clusters. Is it a very detailed term, or does it seem more general? Does one tool return more or less of one type?

-Look at all of the genes each tool returns, are they in multiple clusters, are they small or large clusters?

0
Entering edit mode

This could be partially due to gene identifiers that are "valid" with one tool and not another - see if different tools are throwing out swaths of genes prior to the test due to not finding a match for that id in whatever their definition of a gene is.

10
Entering edit mode
7.1 years ago
pld 5.0k

To start, you should understand what Gene Ontology actually is. GO is a means of providing consistent descriptions of genes/gene products between various databases and projects. GO is a language, not data. Annotation sets of genes from a species with terms from GO are the data you need.

Second, this brings me to PANTHER. PANTHER isn't GO, it is a totally different set of annotations, so it makes sense you're getting different results than the tools relying on GO annotated genes.

Next, you want to make sure that there aren't any differences in the annotation datasets used between platforms. Some websites may be using different versions of the annotation data, some may be using custom or non-standard data.

AgriGO:

Raw GO annotation data is generated using BLAST, Pfam, InterproScan by agriGO or obtained from B2G-FAR center or from Gene Ontology.

BinGO:

Q : The default annotations/ontologies in BiNGO are already several months old. I would like to use more recent annotations.

A : Download the most recent annotation and ontology files from the GO website. You can use these as custom annotation/ontology files.

So you need to understand that there might be differences in the GO associations.

Finally, you need to understand the process that is going on here. The GO annotations are simple curations, GO enrichment is a totally different thing. This is where the specific algorithm and parameters each tool is using really start to matter. Are you getting 17 instead of 213 because some parameter is different. Or, is it just because they're different algorithms?

Never assume that just because two algorithms claim to solve the same problem that you will get the same results. They're doing different things, making different assumptions, require different input data and so on. Also, the semantics really matter, what each algorithm is actually telling you can be different. Never make this assumption unless you have evidence and knowledge to support it.

Also, look back at your data and see which one makes sense. Maybe the program that returned 213 results didn't account for p-value or fold change in your differential expression. Look back at your data and see what makes sense, what do the p-values and fold changes of those 213 genes look like? Are they all very strong or just a few? If it is a dozen out of the 213 that are strongly DE, maybe 17 is better than 213.

A word of advice, never judge the performance of an algorithm based on the plots the programmer makes with the algorithm's output. In other words, what the algorithm is doing is the important thing. You can always make pretty plots with good data, but a pretty plot with bad data is just bad. Find the tool that gives you the best quality results, deal with the plots later.

Pull the pubs on these tools and the methods they use and figure out what they're actually doing. Look at the results you get versus the data you put in, do things make sense? Look at the annotation data used by each tool, is it the same? Can you upload the same GOA data into each tool so that aspect is controlled for? Check and experiment with parameters, how does performance across tools vary with different parameters?

A final bit of advice with these sorts of tools is to always look at version history and release dates. BinGO was last updated 4 years ago, but AgriGO was updated this past August. You should always make sure the software you're using references current databases and that updates for bug fixes have occurred. The GOA version is very important, GO, the species genome, the annotation of genes in that genome and the final GOA product are all moving targets.

These are all good tips for any bioinformatics tools or usage cases.

0
Entering edit mode

Thanks Joe, very clearly put. Two comments:

First, I believe that PANTHER does rely on GOA. See here, Section IV:

http://www.pantherdb.org/help/PANTHERhelp.jsp

Second, how do you assess the situation in the published literature with regard to this discussion? Is it plausibule to assume that authors often choose the GO enrichment that provides results that fit their story best? After all, all algorithms are legitimate. I can even envision using an older annotation version just in order to skew the message. If GO is so malleable, what good is it at all?

(I'm not even touching the subject of term redundancy, to which there are also various solutions that provide a whole gamut of results...)

6
Entering edit mode

See the PANTHER publication:

http://genome.cshlp.org/content/13/9/2129.full

The PANTHER Classifications are the result of human curation as well as sophisticated bioinformatics algorithms. Details of the methods can be found in (Thomas et al., Genome Research 2003; Mi et al. NAR 2005).

GO is basically adjectives, the annotations (GOA) are applying adjectives to nouns. The difference between gene ontology (GO) and gene ontology annotations (GOA) is critical. GO is standard, but GOAs are not.

Imagine if you were judge your lab staff/students and could call them one of the following: (Excellent, Good, Satisfactory, Needs Improvement, Terrible). Then you have another PI judge your staff using the same words. Even though you and the other PI were using the same set of terms, you may not give the same judgements. In other words, GO is what is called a controlled vocabulary, it provides a structured list of possible terms for genes. What GO does not do is regulate the means in which a gene gets annotated with a term.

This means that PANTHER can still annotate genes with GO terms, but their set of annotations may not be the same as that of other groups. Additionally, PANTHER has it's own protein family ontologies that are designed to extend the vocabulary of GO in areas where the PANTHER crew felt that Go was insufficient (PANTHER/X).

Then see this on how the GO consortium generates their "reference" GOAs:

http://www.ncbi.nlm.nih.gov/pubmed/19578431

This is what I mean by different data. Group A may have used a less strict method to annotate genes, Group B may have only annotated genes where experimental evidence was present, Group C could have taken a hybrid approach, Group D could have done the same as C but with different tools.

The literature is good to read for three reasons. For one, pubs and documents online are where you find details of how they built their annotations and what exactly they're using for a vocabulary/ontology (GO only, GO + Extra, GO Slim, how old is the GO version?). Second, this will tell you how they came up with the GOA, which is important. Finally, the part you're most concerned with, the enrichment. Even if two different enrichment tools use the same GOA set, they'll come up with different results, to understand why and how you should respond to these differences you have to know how the enrichment was determined.

This is also why sanity checks (e.g. checking p-values/fold changes against enrichment sets) are so important. Tools work well in general, but there are spots where they fall apart and cases where one tool is better. Doing these checks is a great way to see if the tool is being too strict (sets are very small, false negatives) or it isn't strict enough (large sets, false positives).

As far as choosing enrichment tools, I'd hope that they wouldn't but it may be the case people cherry pick. However I know many (including myself) do what you've done and compare different tools. Reviewers would probably catch someone trying to slip in huge enrichment groups with lots of weakly DE genes/proteins. Using an out of date version of GO or an old GOA would (hopefully) be caught immediately by reviewers.

GO isn't super malleable, usually there are changes where old terms are removed to reduce redundancy or split into new terms to provide greater detail. The annotations (GOA) are what changes, especially with newer genomes or less popular organisms, as the quality of the genome sequence improves and the experimental or in silico characterization of genes improve, annotations have to change. It is a pain, but it has to happen.

You're correct, there's lots of tools out there and you have to look at what the differences are. Some tools use different algorithms to calculate enrichment, some exist simply because they produce cute figures. Some tools are a means of getting data from multiple ontologies/pathway/etc annotations. Not all algorithms are legitimate, some do quantitatively perform worse than others and specific features of your desired usage and/or data can impact the performance of each algorithm differently. Not all tools are legitimate, they can be stuck with outdated GOA/GO terms, be poorly maintained and so on.

When it comes to enrichment, the biggest difference I've seen is in how tools deal with false-positives. Some tools have very good corrections, other tools don't. Check which tools are making corrected p-values and what methods they use. You may have seen 17 instead of 213 because the tool with 213 genes in one cluster didn't make any or used a less strict method for p-value corrections.

You've done a good job in testing multiple programs, to make sense of the differences you need to check with how each works and how well the size and strength of the enriched functions/terms matches what you see in your data. If you really wanted, you could report the results of two different tools (that use different algorithms).

This is a good lesson in bioinformatics, you should never "fire and forget", always check results from different programs, try different settings and be sure you understand what each program is really doing, and always spot check your results against your data.

You'd never run a western to test a new antibody without controls for cross reactivity and a known working antibody, don't run bioinformatics programs blindly.