Question: Comparing gene expression overlap with hypergeometric distribution
gravatar for norty89
2.5 years ago by
norty8910 wrote:

Hi, I got stuck with an issue in analyzing my RNA-seq data. I have three tratments (Control, D1 and D2), and within each tratment I calculate DE genes between two groups (G1 and G2). Then I identify genes that become DE in both D1 and D2 compared to the control. Summarizing, I have (see also Figure):

  • 15,000 genes representing the entire transcriptome
  • 1,000 genes representing those that become DE as effect of treatment, filtering those that are DE in the control. This group is made as union of D1 and D2 DE genes, respectively 450 and 600 genes.
  • Intersecting the DE genes, I found that 50 of them are common to both D1 and D2, 400 are specific to D1 and 550 are specific to D2.

enter image description here

Now I want to understand if this overlap is due by chance or not. So I'd like to have a feedback on my solution and, if wrong, a possible answer to my problem.

I have my observation: out of 1,000 DE genes common to both D1 and D2, 50 of them overlap. Then, I drawn 100 genes from both D1 and D2, and I counted how many overlap (bootstrapped 1,000 times). Let's say that the result of bootstrapped drawing is 10. Finally, I tested if the overlap observed and the one got by random drawing is different through a hypergeometric distribution.

I've done everything with R, using phyper as follows:

Here I have to use 4 values:

x = 10 (overlapping genes from drawing)
m = 50 (overlapping observed)
n = 950 (total DE genes - m)
k = 100 (number of genes drawn from D1 and D2)

I got 0.007 which, if I understood correctly and my procedure is correct as well, means that the overlap I observe between D1 and D2 is not due by chance.

Any opinion about my approach?


test rna-seq overlap • 2.1k views
ADD COMMENTlink modified 2.5 years ago by Devon Ryan97k • written 2.5 years ago by norty8910
gravatar for Devon Ryan
2.5 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:
m = matrix(c(10, 550, 400, 15000-(10+550+400)), nrow=2)
fisher.test(m, alternative='greater')

You need to get the rates in each group from the total number of genes tested, so 15000 rather than 1000. You're wanting to know if the overlap is greater than expected by chance, thus the alternative='greater'. By chance you expect 450/15000 * 600 = 18 overlapping genes, which is more than you actually see. So since the p-value is near 1 you can't reject that you're seeing the overlap by chance (since you did a one-sided test, you can't then turn around and check the other one-sided test, since that'd be p-hacking).

ADD COMMENTlink written 2.5 years ago by Devon Ryan97k

There also exists a Bioconductor package for these type of tests, GeneOverlap

ADD REPLYlink written 2.5 years ago by venu6.7k

Thanks for your reply! Can you clarify me which is the weakness of my approach: comparing the overlap between D1 and D2 to the overlap between two subgroups randmoly drawn from both D1 and D2? My supervisor suggested an approach similar to yours, so I acknoledge that my way is wrong, but I'm still failing to see the weakness of my approach.

ADD REPLYlink written 2.5 years ago by norty8910

If you randomly draw an infinite number of times then you'll approach the p-value from fisher's test.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Devon Ryan97k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1445 users visited in the last hour