Question: Gene Ontology Categories
7
gravatar for alittleboy
8.3 years ago by
alittleboy210
USA
alittleboy210 wrote:

I am new to gene ontology (GO) analysis and I need help for the following question: We use standard hypergeometric method to find out the GO categories that rank on the very top (say, top 10), or as in the paper of Young et. al. 2010 in Genome Biology, can use GOseq to identify the top-ranked categories. My question is what are the criteria for ranking these categories? Are they based on p-values? I appreciate if someone can explain GO analysis briefly or offer some other sources for reference. Thanks!

gene statistics • 12k views
ADD COMMENTlink modified 8.3 years ago by Khader Shameer18k • written 8.3 years ago by alittleboy210
16
gravatar for Khader Shameer
8.3 years ago by
Manhattan, NY
Khader Shameer18k wrote:

Before addressing your specific question, I would like to provide a short overview on Gene Ontology based enrichment analysis:

To perform biological enrichment analysis using ontologies you need following data:

  1. List of genes perturbed in an experiment (say microarray, next-gen sequencing, proteomics etc)
  2. Background list of genes for your study (this could be list of genes that you have used to derive the perturbed genes from microarray, ngs, proteomics etc. For example, list of genes in a microarray, genes in a given genome etc.)
  3. An ontology (in this case, gene_ontology)
  4. Gene Ontology Association file (In this file you can find GO terms from assigned to genes in lists mentioned in 1 and 2)

Note: there are several well-defined biological ontologies, but you may not find corresponding association data. For available list of GO association data see GOA

Enrichment analysis:

Enrichment calculations are classified into 3 categories by Huang et. al as singular enrichment analysis(SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). Basic difference between these three classes of enrichment algorithms are in the way the enrichment p-values are calculated.

In SEA-based approach, annotations terms of subset of genes are assessed one at a time against a list of background genes. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance and individual terms beyond the p-value cut-of (P-value ≤ 0.05). FunctAssociate and Onto-express are two SEA based enrichment analysis tools.

GSEA approaches are similar, but consider all genes during the enrichment analysis, instead of a pre-defined threshold based genes as in the SEA approach. GSEA from broad is an example of GSEA based tool.

MEA based programs like Ontologizer 2.0 and topGO use the relationship that exist between the annotations. These programs were reported to attain better sensitivity and specificity due to the consideration of GO term relationships.

These tools are based on similar Statistical / algorithmic concepts. See a review on 68 tools published in 2008 here, you can see minor-to-medium level differences in the way the nodes are treated, computation of the statistics etc. Statistical methods to derive P-value includes Fisher’s exact test, hypergeometric function, binomial test, χ2 test or combination of these methods.

You can use one of the R package / servers / command-line tools for performing such analysis. See the list of GO based tools compiled by AmiGO team here.

Now to your specific question: Q: what are the criteria for ranking these categories? Are they based on p-values?

A: Yes. They are P-value based. See section on SEA, GSEA and MEA for various methods to derive the P-value.

For a detailed overview of the concepts discussed in this answer see the following articles 1, 2, 3, 4, 5, 6

ADD COMMENTlink modified 7.2 years ago • written 8.3 years ago by Khader Shameer18k

Thank you very much Khader for the explanation and literatures! They are very useful for me to understand GO analysis. I am interested in the statistical tests commonly used in those enrichment tools available, so a background of GO and related tools are definitely helpful for me. Thanks again!

ADD REPLYlink written 8.3 years ago by alittleboy210
5
gravatar for David Quigley
8.3 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

Typically a gene ontology enrichment analysis tests each category in the ontology using a statistic such as the hypergeometric test you mention. Results would then typically be ranked by the strength of the statistic, translated into a P value. Some tools attempt to provide more useful results by looking for results that are significant but farther from the root of the tree, working from the idea that if two results are called significant and one is more specific, then that will be more helpful than knowing that a more general term is enriched. If you're using a P value, your package should correct for the number of tests performed. There's a straightforward overview of this in the user guide for BiNGO, a GO enrichment tool that works as a plug-in for Cytoscape. See also this Nature Reviews: Genetics article for some cautionary information.

ADD COMMENTlink written 8.3 years ago by David Quigley11k
1

A gene can be annotated to many categories, both due to structure and due to one gene doing many things. You test by category, not by gene. If S is your set of genes and !S is every other gene in the annotation and CAT is the category, The 2x2 would be "S in CAT, !S in CAT, S not in CAT, !S not in CAT." Intuitively, this tests whether the proportion of S in CAT is different from the proportion of everything else in CAT. Usually you test for enrichment, just caring whether the proportion of S is greater than !S.

ADD REPLYlink written 8.3 years ago by David Quigley11k

To add on David's answer: You usually use a "Fisher exact test" (which is based on the hypergeometric distribution) http://en.wikipedia.org/wiki/Fisher%27s_exact_test

ADD REPLYlink written 8.3 years ago by Pablo1.9k

Thank you Dave and Pablo for the detailed explanation! Now I understand GO enrichment analysis better, but I still have a question: the genes and GO categories are not one-to-one due to the GO hierarchy structure. If I understand correctly, the Fisher's exact requires a 2x2 contingency table and I imagine the table should be DE/non-DE for rows and category_i/all other categories for columns. If so, how can we build a table without assuming 1-1 correspondence of gene vs. category? Please correct me if I misunderstand the method. Thank you!

ADD REPLYlink written 8.3 years ago by alittleboy210

Thank you Dave for the clarification! It makes more sense to me now.

ADD REPLYlink written 8.3 years ago by alittleboy210
2
gravatar for Chris Evelo
8.3 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

Before you write your own tool it might a good idea to check what is already out there. There are really very many Go approaches, tools and algorithms. [?]This BioStar question[?] describes some of these.

Apart from looking further away from the tree (as David mentioned) you might also want to take into account that effects that you find further away do reoccur in the large classes. You might not want that since they are already taken into account, in which case you should do pruning. The GO_Elite tool that I mentioned in the question above does just that, and it is in fact Open Source so you could also use it as a starting point if you want to do even fancier stuff.

ADD COMMENTlink written 8.3 years ago by Chris Evelo10.0k

Thank you Chris! I will look into the GO_Elite tool you mentioned.

ADD REPLYlink written 8.3 years ago by alittleboy210

Couldn't edit my own (old) post. Wanted to add that a GO-Elite paper has now been published. It is at: http://dx.doi.org/10.1093/bioinformatics/bts366

ADD REPLYlink written 7.1 years ago by Chris Evelo10.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1116 users visited in the last hour