Question: Comparing gene expression overlap with hypergeometric distribution
0
norty890 wrote:

Hi, I got stuck with an issue in analyzing my RNA-seq data. I have three tratments (Control, D1 and D2), and within each tratment I calculate DE genes between two groups (G1 and G2). Then I identify genes that become DE in both D1 and D2 compared to the control. Summarizing, I have (see also Figure):

• 15,000 genes representing the entire transcriptome
• 1,000 genes representing those that become DE as effect of treatment, filtering those that are DE in the control. This group is made as union of D1 and D2 DE genes, respectively 450 and 600 genes.
• Intersecting the DE genes, I found that 50 of them are common to both D1 and D2, 400 are specific to D1 and 550 are specific to D2. Now I want to understand if this overlap is due by chance or not. So I'd like to have a feedback on my solution and, if wrong, a possible answer to my problem.

I have my observation: out of 1,000 DE genes common to both D1 and D2, 50 of them overlap. Then, I drawn 100 genes from both D1 and D2, and I counted how many overlap (bootstrapped 1,000 times). Let's say that the result of bootstrapped drawing is 10. Finally, I tested if the overlap observed and the one got by random drawing is different through a hypergeometric distribution.

I've done everything with R, using `phyper` as follows:

Here I have to use 4 values:

``````x = 10 (overlapping genes from drawing)
m = 50 (overlapping observed)
n = 950 (total DE genes - m)
k = 100 (number of genes drawn from D1 and D2)
lower.tail=FALSE
``````

I got `0.007` which, if I understood correctly and my procedure is correct as well, means that the overlap I observe between D1 and D2 is not due by chance.

Any opinion about my approach?

Thanks!

test rna-seq overlap • 1.1k views
modified 17 months ago by Devon Ryan92k • written 17 months ago by norty890
1
Devon Ryan92k wrote:
``````m = matrix(c(10, 550, 400, 15000-(10+550+400)), nrow=2)
fisher.test(m, alternative='greater')
``````

You need to get the rates in each group from the total number of genes tested, so 15000 rather than 1000. You're wanting to know if the overlap is greater than expected by chance, thus the `alternative='greater'`. By chance you expect `450/15000 * 600 = 18` overlapping genes, which is more than you actually see. So since the p-value is near 1 you can't reject that you're seeing the overlap by chance (since you did a one-sided test, you can't then turn around and check the other one-sided test, since that'd be p-hacking).

1

There also exists a Bioconductor package for these type of tests, GeneOverlap

Thanks for your reply! Can you clarify me which is the weakness of my approach: comparing the overlap between D1 and D2 to the overlap between two subgroups randmoly drawn from both D1 and D2? My supervisor suggested an approach similar to yours, so I acknoledge that my way is wrong, but I'm still failing to see the weakness of my approach.