Question

Enrichment Analysis Based On Quantitative Annotation Scores

6

Entering edit mode

12.5 years ago

Andrew Su 4.9k

Annotation enrichment analyses (like GSEA) are of course very common these days for the analysis of genome-scale data. However, they typically are based on qualitative (and absolute) gene annotations. For example, the gene CDK2 is involved in cell cycle with no ambiguity or uncertainty.

Is anyone aware of enrichment approaches that are based on quantitative confidence scores? Such a method would be able to intelligently use data that said CDK2 is involved in cell cycle with >> 99% certainty, whereas TBL1X is associated with autism at only 25% confidence.

Ignore for the moment where exactly those confidence scores might come from, but one can imagine that they might come from some text mining process. Any thoughts or leads?

enrichment statistics • 3.0k views

ADD COMMENT • link updated 12.5 years ago by Qdjm 1.9k • written 12.5 years ago by Andrew Su 4.9k

score 2 · Answer 1 · 2011-11-09

2

Entering edit mode

12.5 years ago

Istvan Albert 100k

Some of these terms don't seem to lend themselves to quantification, a gene is either involved in cell cycle or not (and of course it all depends on what 'involved' means). A number next to this would at best represent the information/knowledge available to the individual making the statement - but how could that be a generic concept that applies to everyone?

Perhaps it is the word involved that serves as the actual quantification. When people have only a hunch (10-25% certainty) then they call it involved, if it is 25%-50% we are in the associated territory and so on, even higher and we get into the stronger terms implying causation.

ADD COMMENT • link 12.5 years ago by Istvan Albert 100k

0

Entering edit mode

Though I'm not sure I agree that "involved" and "associated" differ by a level of confidence, your point is well taken about how one would interpret a quantitative score. This perhaps is a limitation of trying to boil down a rich picture of biological knowledge down to a structured gene annotation...

ADD REPLY • link 12.4 years ago by Andrew Su 4.9k

score 2 · Answer 2 · 2011-11-09

I am not aware of any methods that assign a quantitative value to the likelihood that a gene is assigned to a GO term. I think the reason this is the case is that tool developers assume that uncertainty should be quantified at the set level, not the gene level.

The only work that I am aware of where a quantitative value is assigned to genes in GO analysis are methods that attempt to correct for non-uniformity in genomic locus length when assigning genomic features to GO categories:

GONOME: measuring correlations between GO terms and genomic positions. http://www.ncbi.nlm.nih.gov/pubmed/16504139
Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. http://www.ncbi.nlm.nih.gov/pubmed/19168912
GREAT improves functional interpretation of cis-regulatory regions. http://www.ncbi.nlm.nih.gov/pubmed/20436461

These are not exactly what you are looking for, but they might be a starting point to think about assigning a variable weight to genes in GO analyses.

Picking up on @Istvan's comment, even if you did have a method, where would these values come from? One place to look is in the inherent uncertainty reported in the literature in terms of contradictions. For example, you could assign the ratio of positive to negative mentions of a gene-GO link using contradiction mining, as has been recently done in the bioNOT system: http://www.ncbi.nlm.nih.gov/pubmed/22032181

score 1 · Answer 3 · 2011-11-10

Three thoughts:

If you want to test for enrichment of an annotation in a gene list (e.g. Fisher's exact test) and if your confidence measures can be interpreted as probabilities that the annotation is correct, you could sum these probabilities up for all the genes in the gene list and compare this sum to what you would expect from a random subset of the same size from the background set. I bet that the null distribution is normal and a Chi-squared test is appropriate but you'd have to check both these guesses. The interpretation of this is that the sum of the gene list probabilities is the "expected number of genes with that annotation" in the gene list.
Again, if your confidence measures are probabilities, you can sample annotations (i.e. give TBL1X an autism annotation with probability 25%) and then apply your favourite enrichment test. Then resample and redo it, etc. Not sure how to combine the P-values -- maybe take the median? -- but to be rigourous you should calculate the null distribution of the P-values through randomization.
If the confidence measures are not probabilities, you could try calculating a Spearman correlation between the quantitative annotation score and and quantitative measurement associated with each gene.