Question

Comparing gene expression overlap with hypergeometric distribution

1

Entering edit mode

6.0 years ago

norty89 ▴ 10

Hi, I got stuck with an issue in analyzing my RNA-seq data. I have three tratments (Control, D1 and D2), and within each tratment I calculate DE genes between two groups (G1 and G2). Then I identify genes that become DE in both D1 and D2 compared to the control. Summarizing, I have (see also Figure):

15,000 genes representing the entire transcriptome
1,000 genes representing those that become DE as effect of treatment, filtering those that are DE in the control. This group is made as union of D1 and D2 DE genes, respectively 450 and 600 genes.
Intersecting the DE genes, I found that 50 of them are common to both D1 and D2, 400 are specific to D1 and 550 are specific to D2.

enter image description here

Now I want to understand if this overlap is due by chance or not. So I'd like to have a feedback on my solution and, if wrong, a possible answer to my problem.

I have my observation: out of 1,000 DE genes common to both D1 and D2, 50 of them overlap. Then, I drawn 100 genes from both D1 and D2, and I counted how many overlap (bootstrapped 1,000 times). Let's say that the result of bootstrapped drawing is 10. Finally, I tested if the overlap observed and the one got by random drawing is different through a hypergeometric distribution.

I've done everything with R, using phyper as follows:

Here I have to use 4 values:

x = 10 (overlapping genes from drawing)
m = 50 (overlapping observed)
n = 950 (total DE genes - m)
k = 100 (number of genes drawn from D1 and D2)
lower.tail=FALSE

I got 0.007 which, if I understood correctly and my procedure is correct as well, means that the overlap I observe between D1 and D2 is not due by chance.

Any opinion about my approach?

Thanks!

RNA-Seq overlap test • 4.6k views

ADD COMMENT • link updated 6.0 years ago by Devon Ryan 104k • written 6.0 years ago by norty89 ▴ 10

score 3 · Accepted Answer · 2018-04-14

3

Entering edit mode

6.0 years ago

Devon Ryan 104k

m = matrix(c(10, 550, 400, 15000-(10+550+400)), nrow=2)
fisher.test(m, alternative='greater')

You need to get the rates in each group from the total number of genes tested, so 15000 rather than 1000. You're wanting to know if the overlap is greater than expected by chance, thus the alternative='greater'. By chance you expect 450/15000 * 600 = 18 overlapping genes, which is more than you actually see. So since the p-value is near 1 you can't reject that you're seeing the overlap by chance (since you did a one-sided test, you can't then turn around and check the other one-sided test, since that'd be p-hacking).

ADD COMMENT • link 6.0 years ago by Devon Ryan 104k

1

Entering edit mode

There also exists a Bioconductor package for these type of tests, GeneOverlap

ADD REPLY • link 6.0 years ago by venu 7.1k

0

Entering edit mode

Thanks for your reply! Can you clarify me which is the weakness of my approach: comparing the overlap between D1 and D2 to the overlap between two subgroups randmoly drawn from both D1 and D2? My supervisor suggested an approach similar to yours, so I acknoledge that my way is wrong, but I'm still failing to see the weakness of my approach.

ADD REPLY • link 6.0 years ago by norty89 ▴ 10

0

Entering edit mode

If you randomly draw an infinite number of times then you'll approach the p-value from fisher's test.

ADD REPLY • link 6.0 years ago by Devon Ryan 104k