How To Compute Gene Enrichment P-Value? (Significantly More Than Expected)
3
7
Entering edit mode
9.2 years ago
gundalav ▴ 370

Suppose I have a list of genes

mygenes: gene13,gene2,gene111.


And given another list of genes

gene_categoryA: gene1, gene2, gene44, gene111.
gene_categoryB: gene13,gene34.


After comparing mygenes against gene_categoryA we see that there are 2 genes of categoryA in mygenes.

What I want to know whether these 2 genes (gene2 and gene111) occurrence is significantly more than expected.

What's is the best way to go about it.

enrichment statistics • 13k views
14
Entering edit mode
9.2 years ago

Usually a hypergeometric test is used in this situation. Here is a python scrip that I use for this: (requires scipy)

import sys
import scipy.stats as stats

print
print 'total number in population: ' + sys.argv[1]
print 'total number with condition in population: ' + sys.argv[2]
print 'number in subset: ' + sys.argv[3]
print 'number with condition in subset: ' + sys.argv[4]
print
print 'p-value <= ' + sys.argv[4] + ': ' + str(stats.hypergeom.cdf(int(sys.argv[4]) ,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
print
print 'p-value >= ' + sys.argv[4] + ': ' + str(stats.hypergeom.sf(int(sys.argv[4]) - 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
print


Use it by:

python script.py [total number of genes in the list] [total number of genes in the list with condition A] [total number of genes in the list with condition B] [number of genes with both condition A and B]


The result will be

• a p-value where by random chance number of genes with both condition A and B will be <= to your number with condition A and B
• a p-value where by random chance number of genes with both condition A and B will be >= to your number with condition A and B

The second p-value is probably what you want.

0
Entering edit mode

Thanks Damian. What if I have more than 2 gene category to compare, e.g. gene_categoryA,...gene_categoryK. Otherway to look at it is that now in the urn the balls are not only red and black, but more colours. How can I modified your code with that? The task is still the same, namely to check whether my set of gene is significantly from gene_categoryA.

1
Entering edit mode

You would have to use a multivariate hypergeometric distribution. I am not sure if scipy has that function.

0
Entering edit mode

if the question is the same (i.e. Check whether the set of genes is significantly from gene_categoryA) then I don't see why it should matter how many categories are there, after all we can abstract all those as "non A categories" and proceed the same way to calculate the probability for category A to be over represented in our the gene list. Am I missing something here?

0
Entering edit mode

@Damian: I was wondering why you are subtracting 1 from arg[4] when you are calculating the survival function. The same type of question can be asked for adding 1 to arg[4] when calculating the CDF? Is it because we are working with discrete values and to include the instance X=x in the calculation we have to either add or subtract 1?

1
Entering edit mode

I always forget how these two functions goes (cdf,sf) in terms of whether it is off by one or not when you are want to do > or >=.

I think I got an e-mail from someone asking this same question earlier this year. It turns out the p-value <= than portion of the above script already calculates <=, so it is unnecessary to add 1. The p-value >= portion is still correct since the sf (survival function) calculates >.

I've edited the post to reflect this. I feel bad now for propagating wrong information.

2
Entering edit mode
9.2 years ago

Are the categories discrete? If so Damian's answer looks good (also check out qhyper on R). If they are not discrete then it might be a bit more complicated (i.e. genes can be in either category). A likelihood ratio test such as Fishers Exact or a Chi-squared should also work. I am under the impression that a 1-tailed Fischer's exact test is equivalent to a hypergeometric test. (I hope that those with better stats knowledge then myself: will correct this if I am wrong).

1
Entering edit mode
9.2 years ago
NextGenSeek ▴ 290

It is also worth knowing about some of the assumptions of these approaches. Here is a good paper to get started "Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets" http://www.biomedcentral.com/1471-2164/11/574