Seeking practical advice for choosing an algorithm for performing GO enrichment.
2
4
Entering edit mode
10.2 years ago
Eric Fournier ★ 1.4k

The most common way of performing GO enrichment (hypergeometric tests on selected subsets of genes) is straightforward enough, but I'm finding a lot of papers which propose alternate methods which take into account the hierarchy of GO terms or gene scores:

Cao & Zhang, 2014, A Bayesian Extension of the Hypergeometric Test for Functional Enrichment Analysis, Biometrics 70, 84-94

Alexa, 2006, Improved scoring of functional groups from gene expression data by decorrelating GO graph structure, Bioinformatics 22, 1600-1607

Grossman, 2007, Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis, Bioinformatics 23, 3024-3031

I don't quite understand the math behind each methods, and obviously each paper claims that their method is better than the previous ones. I used the topGO package to test a couple, and the enrichment lists I generate show little similarity.

Could anyone provide practical guidelines on which method(s) would give the most relevant biological results? One caveat is that I am integrating this analysis within a larger automated pipeline, so automated tools are out.

microarray enrichment GO • 4.4k views
ADD COMMENT
2
Entering edit mode

Here is another one to make your life more harder: ermineJ

Publication: http://www.nature.com/nprot/journal/v5/n6/full/nprot.2010.78.html

ADD REPLY
1
Entering edit mode

I've come to really like ermineJ

ADD REPLY
1
Entering edit mode
10.2 years ago

Mine is a very empirical & practical opinion: I quite like DAVID as it is flexible and produces a very readable output. The R package RDAVIDWebService allows querying david via R and it's very handy (but note the limitations imposed by david about number of jobs that a single user can submit). Also, david has been used very extensively so it's very well tried & tested.

In general, I find the results from GO analyses to be so open to interpretation that I'm less concerned about finding the very best algorithm, I prefer to favour a practical approach. Neverthless, I should say that for RNA-seq data you might want to control for gene length (see http://genomebiology.com/2010/11/2/R14 and the GOSeq package).

ADD COMMENT
1
Entering edit mode

Just remember DAVID was last updated in 2009 and none of the authors have been responding to calls/emails.

ADD REPLY
0
Entering edit mode

Mmm, yes... The current release is dated Jan 2010 but their forums seem reasonably active.

ADD REPLY
0
Entering edit mode
8.0 years ago
prasadhendre ▴ 20

I would suggest you download a specific GO database and use spreadsheet such as Excel to calculate your own hypergeometric p-values. Some online tools seem to use 'total GO space' and the GO class (BP, MF, CC) irrespective of the organism. I also think some tools although separate organism but mix all these three GOs to test for p-value. If you have control on these, you are sure of what are you looking at.

ADD COMMENT
1
Entering edit mode

I think this thread has sufficient answers already, and in my honest opinion your advice is, well, bad. Using spreadsheets for analysis such as Ms Excel is likely to run in problems sooner or later by manual errors or undesirable conversions. Furthermore, it eliminates reproducible research.

I'm not sure why you would think doing this work manual in a fail-easy way would be better than using commonly used tools.

ADD REPLY
1
Entering edit mode

Well I do it this way and find it handy. I can see the numbers on my own, I have way to verify the p-value. Whenever I did it using AMIGO, I thought it uses the 'whole GO space', and difficult to verify. I work with plant data (mainly arabidopsis) and so I download the GO database from EMBL/TAIR may be once in three months and I already have process established to fit it into my spreadsheet. Calculations don't take very long but I am OK to wait even if required as I am sure what am I looking at. I can also apply single step or two step FDR correction with full knowledge of how it is affecting my data. It is often difficult to verify and cross check the output from online tools.

ADD REPLY
1
Entering edit mode

It is often difficult to verify and cross check the output from online tools.

I fully agree with that and the reason why you prefer the manual -for full control- approach is also justified and very clear, thanks for explaining. However, for (re)usability I would still fit this approach in a Python or R script (also easier for sharing).

ADD REPLY
0
Entering edit mode

When rereading my statement above I realize it's overly harsh and I feel like I should apologize for that, you indeed have good reasons to work the way you do.

ADD REPLY
0
Entering edit mode

I for one think this is an awesome idea. This way you have full control and understand exactly what is going on.

As much as we would like to claim otherwise nobody really knows what is going on inside the deep bowels of GO enrichment tools - I never myself managed to reproduce their values, I don't understand what they do. I only know what they claim to do.

As bioinformaticians we need stop assuming that EXCEL = BAD - the damage is always done by people not understanding what a command line (or any tool for that matter) does no by the tool itself.

For any error caused via Excel there are more insidious and stupid behaviors in R for example. Did you know that when operating on two unequal length series (vectors for example) the shorter one will be SILENTLY reused in R? How many people know that? If you by accident have two unequal length objects and perform an element wise addition when the shorter one runs out it will start again from the beginning. It is hard to fathom how many errors that causes.

ADD REPLY
1
Entering edit mode

I use Excel a lot, and it definitely has a place in science. However, I would argue the room for error is smaller when executing a decent Rscript than when 'tampering' with spreadsheets. But spreadsheets don't kill data, people kill data ;-)

And vector recycling is usually a handy feature, although if unexpected it can definitely cause you a major (silent) headache...

ADD REPLY
3
Entering edit mode

the key word is "decent" R script - the vast majority of R scripts are not - and that is because R itself is not designed to help writing decent scrips, everything about it encourages quick and dirty, interactive type actions.

IMHO those "handy" features cause damage that we only see reported as irreproducible research - whatever effort they saved at an individual level they cost us scientists far more

R is an improperly designed language, it won the data analysis battle, it has already grown into a system that simply cannot be replaced by a well designed language. As our needs and complexities grow it starts to weigh more and more heavily.

ADD REPLY
0
Entering edit mode

Exactly, and because tampering with data formats interactively in R is as risky (if not worse) than excel manipulations, I prefer to do work in R using 'custom' Rscript command line tools. But unpredictable data types in R are driving me crazy when writing those Rscripts...

I should spend more time with Julia, if only I had the time.

ADD REPLY
0
Entering edit mode

Oh, and cryptolockers. Those also kill data. #BadDayAtTheLab

ADD REPLY
0
Entering edit mode

Would you care the share the sheet? I would be interested to see how it works.

ADD REPLY
0
Entering edit mode

I am working with these spreadsheets (separate for three GOs, two pathways) at the moment for improving the analysis pipeline but I use simple commands like countif, index-match, lookup, hypergeo, rank etc. I don't need to be a core programmer for this. Once I am done I will share.

ADD REPLY

Login before adding your answer.

Traffic: 876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6