Question: Hypergeometric test for overlapped genes
gravatar for Nitin
6.1 years ago by
Nitin150 wrote:


I have compared 1370 genes which are obtained from chip seq analysis and 652 number of genes which are differentially regulated genes obtained by analyzing affymetrix 430.2.0 mouse array. When i intersect list both these lists of genes i got 37 genes common.

Now i would like to calculate what is the significance of this over lap. I was thinking to use phyper in R but this requires total number of genes here i am confused which number to give . Should i give total number of probes from affymetrix chip or should i give whole mouse genome number from chip seq data..

One more question can anybody suggest some ways how exactly to perfom this significance test in R.





chip-seq gene expression • 3.5k views
ADD COMMENTlink modified 3.2 years ago by Gema Sanz70 • written 6.1 years ago by Nitin150

Sorry for the big letters... I don't know why they look like that...

ADD REPLYlink written 3.2 years ago by Gema Sanz70

I think that whatever the results of a hypergeometric test might be in this case, a lot of caution should be advised in interpreting that result.

it is well-known that Chip-Seq, RNA-seq and other NGS based assays of the epigenome and transcriptome measure a lot of correlated outcomes (e.g. shared epigenetic programs affecting many genes).

It is very conceivable that in an example like the one you describe, it could be a much smaller number of factors that lead you to detect 37 genes in the overlap bin. If this is the case, using a test like the hypergeometric would be anti-conservative, and a lot of readers and reviewers might distrust the result for that reason.

ADD REPLYlink written 3.2 years ago by Vincent Laufer1.1k
gravatar for mikhail.shugay
6.1 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

I recommend using the number of genes on Affy array (those would be the smallest one). But before you should filter those 1370 genes from Chip Seq so they only contain the genes also present on Affy array. And you should use not probes, but genes, as there are several probes per gene. Then compute 1 - F(37, 652, x, X) + 0.5 * P(37, 652, x X), where x is the number of Chip Genes, X total number of genes, F(.) is cumulative distribution function and P(.) is probability for Hypergeometric distribution.

PS Anyways the P-value looks to be insignificant. Have you tried checking up-/down-regulated genes separately or, better, using GSEA as suggested by Devon Ryan?

ADD COMMENTlink modified 6 months ago by RamRS27k • written 6.1 years ago by mikhail.shugay3.4k
gravatar for Devon Ryan
6.1 years ago by
Devon Ryan95k
Freiburg, Germany
Devon Ryan95k wrote:

Take as the total number of genes (N), the intersect of the genes you looked at via ChIP-seq (all of them in the annotation you used, though this will be a bit of an overestimate, realistically) and those probed on the array (N.B., genes, not probes, as pointed out by mikhail.shugay ). In R, that could be done with:

m <- matrix(c(37,652-37,1370-37, N-652-1370+37), ncol=2,
    dimnames=list(c("ChIP.Sig", "ChIP.NoSig"), c("RNA.Sig","RNA.NoSig")))

You'll find that if N is ~18000 or bigger then this isn't significant.

BTW, you might consider something like GSEA, where you use those genes showing peaks in their promoter (or where ever else you're looking) as the gene set. This would have the advantage in that it looks at the genes as a group, rather than relying on individual significant adjusted p-values. One could easily consider even more complicated tests (e.g., what if many of the DE genes have promoters with low mappability?), but I'd probably avoid going really whole-hog without reason.

ADD COMMENTlink modified 6 months ago by RamRS27k • written 6.1 years ago by Devon Ryan95k


I followed your suggestion to calculate the p-value of my microarray and ChIP-seq data:

# bound chip-seq = 5673
# down microarray = 3975
# overlap = 1156
# N = 17970 

  m <- matrix(c(1156,3975-1156,5673-1156,17970-3975-5673+1156), ncol=2,
              dimnames=list(c("ChIP.Sig", "ChIP.NoSig"), c("RNA.Sig","RNA.NoSig")))

# Fisher's Exact Test for Count Data

data:  m
p-value = 0.0001287
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7958616 0.9299321
sample estimates:
odds ratio 

Then I wanted to check if a random subset of the same size could be significant too, the overlap of random set with binding was 683. But when I do the calculation, the p-value is much more significant (p-value < 2.2e-16, the lowest result by Fisher test in R). I don't understand why a lower overlap can be more significant....

Am I missing something? Maybe I don't really need to check with a random gene set? This is a reply for reviewers because they suggest that the binding of my TF won't be significant over a random subset, is it enough for reply the p-value I got from Fisher test?

Thanks in advance Gema

ADD REPLYlink modified 3.2 years ago by genomax85k • written 3.2 years ago by Gema Sanz70

I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing.


How To Ask Good Questions On Technical And Scientific Forums

ADD REPLYlink written 3.2 years ago by genomax85k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1041 users visited in the last hour