I am new to gene ontology (GO) analysis and I need help for the following question: We use standard hypergeometric method to find out the GO categories that rank on the very top (say, top 10), or as in the paper of Young et. al. 2010 in Genome Biology, can use GOseq to identify the top-ranked categories. My question is what are the criteria for ranking these categories? Are they based on p-values? I appreciate if someone can explain GO analysis briefly or offer some other sources for reference. Thanks!
Before addressing your specific question, I would like to provide a short overview on Gene Ontology based enrichment analysis:
To perform biological enrichment analysis using ontologies you need following data:
- List of genes perturbed in an experiment (say microarray, next-gen sequencing, proteomics etc)
- Background list of genes for your study (this could be list of genes that you have used to derive the perturbed genes from microarray, ngs, proteomics etc. For example, list of genes in a microarray, genes in a given genome etc.)
- An ontology (in this case, gene_ontology)
- Gene Ontology Association file (In this file you can find GO terms from assigned to genes in lists mentioned in 1 and 2)
Enrichment calculations are classified into 3 categories by Huang et. al as singular enrichment analysis(SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). Basic difference between these three classes of enrichment algorithms are in the way the enrichment p-values are calculated.
In SEA-based approach, annotations terms of subset of genes are assessed one at a time against a list of background genes. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance and individual terms beyond the p-value cut-of (P-value ≤ 0.05). FunctAssociate and Onto-express are two SEA based enrichment analysis tools.
GSEA approaches are similar, but consider all genes during the enrichment analysis, instead of a pre-defined threshold based genes as in the SEA approach. GSEA from broad is an example of GSEA based tool.
MEA based programs like Ontologizer 2.0 and topGO use the relationship that exist between the annotations. These programs were reported to attain better sensitivity and specificity due to the consideration of GO term relationships.
These tools are based on similar Statistical / algorithmic concepts. See a review on 68 tools published in 2008 here, you can see minor-to-medium level differences in the way the nodes are treated, computation of the statistics etc. Statistical methods to derive P-value includes Fisher’s exact test, hypergeometric function, binomial test, χ2 test or combination of these methods.
You can use one of the R package / servers / command-line tools for performing such analysis. See the list of GO based tools compiled by AmiGO team here.
Now to your specific question: Q: what are the criteria for ranking these categories? Are they based on p-values?
A: Yes. They are P-value based. See section on SEA, GSEA and MEA for various methods to derive the P-value.
Typically a gene ontology enrichment analysis tests each category in the ontology using a statistic such as the hypergeometric test you mention. Results would then typically be ranked by the strength of the statistic, translated into a P value. Some tools attempt to provide more useful results by looking for results that are significant but farther from the root of the tree, working from the idea that if two results are called significant and one is more specific, then that will be more helpful than knowing that a more general term is enriched. If you're using a P value, your package should correct for the number of tests performed. There's a straightforward overview of this in the user guide for BiNGO, a GO enrichment tool that works as a plug-in for Cytoscape. See also this Nature Reviews: Genetics article for some cautionary information.
Before you write your own tool it might a good idea to check what is already out there. There are really very many Go approaches, tools and algorithms. [?]This BioStar question[?] describes some of these.
Apart from looking further away from the tree (as David mentioned) you might also want to take into account that effects that you find further away do reoccur in the large classes. You might not want that since they are already taken into account, in which case you should do pruning. The GO_Elite tool that I mentioned in the question above does just that, and it is in fact Open Source so you could also use it as a starting point if you want to do even fancier stuff.