Question

How To Compute Gene Enrichment P-Value? (Significantly More Than Expected)

7

Entering edit mode

12.3 years ago

gundalav ▴ 380

Suppose I have a list of genes

mygenes: gene13,gene2,gene111.

And given another list of genes

gene_categoryA: gene1, gene2, gene44, gene111.
gene_categoryB: gene13,gene34.

After comparing mygenes against gene_categoryA we see that there are 2 genes of categoryA in mygenes.

What I want to know whether these 2 genes (gene2 and gene111) occurrence is significantly more than expected.

What's is the best way to go about it.

enrichment statistics • 17k views

ADD COMMENT • link updated 3.9 years ago by Elio • 0 • written 12.3 years ago by gundalav ▴ 380

score 14 · Answer 1 · 2013-03-18

14

Entering edit mode

12.3 years ago

Damian Kao 16k

Usually a hypergeometric test is used in this situation. Here is a python scrip that I use for this: (requires scipy)

import sys
import scipy.stats as stats

print
print 'total number in population: ' + sys.argv[1]
print 'total number with condition in population: ' + sys.argv[2]
print 'number in subset: ' + sys.argv[3]
print 'number with condition in subset: ' + sys.argv[4]
print
print 'p-value <= ' + sys.argv[4] + ': ' + str(stats.hypergeom.cdf(int(sys.argv[4]) ,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
print 
print 'p-value >= ' + sys.argv[4] + ': ' + str(stats.hypergeom.sf(int(sys.argv[4]) - 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
print

Use it by:

python script.py [total number of genes in the list] [total number of genes in the list with condition A] [total number of genes in the list with condition B] [number of genes with both condition A and B]

The result will be

a p-value where by random chance number of genes with both condition A and B will be <= to your number with condition A and B
a p-value where by random chance number of genes with both condition A and B will be >= to your number with condition A and B

The second p-value is probably what you want.

ADD COMMENT • link 8.9 years ago by Damian Kao 16k

0

Entering edit mode

Thanks Damian. What if I have more than 2 gene category to compare, e.g. gene_categoryA,...gene_categoryK. Otherway to look at it is that now in the urn the balls are not only red and black, but more colours. How can I modified your code with that? The task is still the same, namely to check whether my set of gene is significantly from gene_categoryA.

ADD REPLY • link 12.3 years ago by gundalav ▴ 380

1

Entering edit mode

You would have to use a multivariate hypergeometric distribution. I am not sure if scipy has that function.

ADD REPLY • link 12.3 years ago by Damian Kao 16k

0

Entering edit mode

if the question is the same (i.e. Check whether the set of genes is significantly from gene_categoryA) then I don't see why it should matter how many categories are there, after all we can abstract all those as "non A categories" and proceed the same way to calculate the probability for category A to be over represented in our the gene list. Am I missing something here?

ADD REPLY • link 3.9 years ago by Elio • 0

0

Entering edit mode

@Damian: I was wondering why you are subtracting 1 from arg[4] when you are calculating the survival function. The same type of question can be asked for adding 1 to arg[4] when calculating the CDF? Is it because we are working with discrete values and to include the instance X=x in the calculation we have to either add or subtract 1?

ADD REPLY • link 8.9 years ago by Dataman ▴ 380

1

Entering edit mode

I always forget how these two functions goes (cdf,sf) in terms of whether it is off by one or not when you are want to do > or >=.

I think I got an e-mail from someone asking this same question earlier this year. It turns out the p-value <= than portion of the above script already calculates <=, so it is unnecessary to add 1. The p-value >= portion is still correct since the sf (survival function) calculates >.

I've edited the post to reflect this. I feel bad now for propagating wrong information.

ADD REPLY • link 8.9 years ago by Damian Kao 16k

score 2 · Answer 2 · 2013-03-18

Are the categories discrete? If so Damian's answer looks good (also check out qhyper on R). If they are not discrete then it might be a bit more complicated (i.e. genes can be in either category). A likelihood ratio test such as Fishers Exact or a Chi-squared should also work. I am under the impression that a 1-tailed Fischer's exact test is equivalent to a hypergeometric test. (I hope that those with better stats knowledge then myself: will correct this if I am wrong).

score 1 · Answer 3 · 2013-03-18

1

Entering edit mode

12.3 years ago

NextGenSeek ▴ 290

It is also worth knowing about some of the assumptions of these approaches. Here is a good paper to get started "Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets" http://www.biomedcentral.com/1471-2164/11/574

ADD COMMENT • link 12.3 years ago by NextGenSeek ▴ 290