Question: Comparing gene expression overlap with hypergeometric distribution
gravatar for norty89
17 months ago by
norty890 wrote:

Hi, I got stuck with an issue in analyzing my RNA-seq data. I have three tratments (Control, D1 and D2), and within each tratment I calculate DE genes between two groups (G1 and G2). Then I identify genes that become DE in both D1 and D2 compared to the control. Summarizing, I have (see also Figure):

  • 15,000 genes representing the entire transcriptome
  • 1,000 genes representing those that become DE as effect of treatment, filtering those that are DE in the control. This group is made as union of D1 and D2 DE genes, respectively 450 and 600 genes.
  • Intersecting the DE genes, I found that 50 of them are common to both D1 and D2, 400 are specific to D1 and 550 are specific to D2.

enter image description here

Now I want to understand if this overlap is due by chance or not. So I'd like to have a feedback on my solution and, if wrong, a possible answer to my problem.

I have my observation: out of 1,000 DE genes common to both D1 and D2, 50 of them overlap. Then, I drawn 100 genes from both D1 and D2, and I counted how many overlap (bootstrapped 1,000 times). Let's say that the result of bootstrapped drawing is 10. Finally, I tested if the overlap observed and the one got by random drawing is different through a hypergeometric distribution.

I've done everything with R, using phyper as follows:

Here I have to use 4 values:

x = 10 (overlapping genes from drawing)
m = 50 (overlapping observed)
n = 950 (total DE genes - m)
k = 100 (number of genes drawn from D1 and D2)

I got 0.007 which, if I understood correctly and my procedure is correct as well, means that the overlap I observe between D1 and D2 is not due by chance.

Any opinion about my approach?


test rna-seq overlap • 1.1k views
ADD COMMENTlink modified 17 months ago by Devon Ryan92k • written 17 months ago by norty890
gravatar for Devon Ryan
17 months ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:
m = matrix(c(10, 550, 400, 15000-(10+550+400)), nrow=2)
fisher.test(m, alternative='greater')

You need to get the rates in each group from the total number of genes tested, so 15000 rather than 1000. You're wanting to know if the overlap is greater than expected by chance, thus the alternative='greater'. By chance you expect 450/15000 * 600 = 18 overlapping genes, which is more than you actually see. So since the p-value is near 1 you can't reject that you're seeing the overlap by chance (since you did a one-sided test, you can't then turn around and check the other one-sided test, since that'd be p-hacking).

ADD COMMENTlink written 17 months ago by Devon Ryan92k

There also exists a Bioconductor package for these type of tests, GeneOverlap

ADD REPLYlink written 17 months ago by venu6.3k

Thanks for your reply! Can you clarify me which is the weakness of my approach: comparing the overlap between D1 and D2 to the overlap between two subgroups randmoly drawn from both D1 and D2? My supervisor suggested an approach similar to yours, so I acknoledge that my way is wrong, but I'm still failing to see the weakness of my approach.

ADD REPLYlink written 17 months ago by norty890

If you randomly draw an infinite number of times then you'll approach the p-value from fisher's test.

ADD REPLYlink modified 17 months ago • written 17 months ago by Devon Ryan92k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 740 users visited in the last hour