Question: Probability of gene list overlap
13
gravatar for Nasir
6.5 years ago by
Nasir140
Nasir140 wrote:

Is this the correct way to calculate the probability of overlap occurring by chance between two lists of unregulated genes generated from two independent experiments on the same tissue using the same platform?

I am using R phyper.

phyper(q, m, n, k, lower.tail = FALSE, log.p = FALSE)

where:
q=size of overlap-1;
m=number of upregulated genes in experiment #1;
n=(total number of genes on platform-m);
k=number of upregulated genes in experiment #2.

overlap R gene • 9.9k views
ADD COMMENTlink modified 5 weeks ago by Biostar ♦♦ 20 • written 6.5 years ago by Nasir140
2

Here is a good post on the stackexchange stats Q&A about hypergeometric for list overlap.

ADD REPLYlink modified 3 months ago by zx87544.5k • written 6.5 years ago by Damian Kao14k

yes this is the correct way to do this.

ADD REPLYlink written 6.5 years ago by Gjain5.2k

Hello,

I just have a similar question: Can I use R phyper in the above way to calculate the probability of overlap occurring by chance between two lists of hit genes identified by two different algorithmsg on the same tissue?

Thank you very much!

ADD REPLYlink written 4.1 years ago by E.G.C.0

I think yes, as long as the total number of genes is identical between both tries. Note that the probability is those of two independent random draws ('totally random'), that is maybe not such a good benchmark for a comparison of two algorithms.  

ADD REPLYlink written 4.1 years ago by Michael Dondrup44k
3
gravatar for Michael Dondrup
6.5 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

Yes, this seems correct to me. Imagine an urn model with black an white balls in an urn from which balls are drawn without replacement. If the two experiments are independent, you can use either experiment to label the balls in the urn: m = number of white balls in the urn, n the number of black balls. Then repeat the drawing. What is the probability of drawing q or more of the same genes (hence q-1, the distribution includes q only for lower.tail = TRUE) (hence lower.tail = FALSE) white balls (significant in exp. A, gene-set size) in k draws(number of significant genes in experiment B, gene-set size).

ADD COMMENTlink written 6.5 years ago by Michael Dondrup44k
1
gravatar for brentp
6.5 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

I think that is mostly correct except that you probably want 1 - phyper(...).

Examples

The probability of out of ~20K genes (not accounting for -m). Having 10 shared out of 2 random subsets of 100 should be very small.

> 1 - phyper(10, 100, 20000, 100, log.p=F)
[1] 2.582823e-12

The number probability of having only 1 shared should be a bit larger:

> 1 - phyper(1, 100, 20000, 100, log.p=F)
[1] 0.08868589

Also search for hypergeometric here on biostar.

 

EDIT: you actually need to use:

> phyper(9, 100, 20000 - 100, 100, lower.tail=F)

 

ADD COMMENTlink modified 3.6 years ago • written 6.5 years ago by brentp22k
1

Michael, thanks! I missed that. :)

ADD REPLYlink written 6.5 years ago by brentp22k

instead of using 1-phyper you could also set lower.tail=F, that's what OP did, thus this is both correct.

ADD REPLYlink written 6.5 years ago by Michael Dondrup44k

Yes, just double checked, both phyper with lower-tail=false and (1-phyper) give same results as Michael mentioned. Thank you all.

ADD REPLYlink written 6.5 years ago by Nasir140

Hi Brent, I am bit confused, in the querry q=overlap -1 while in your answer you assume 10 shared out of 2 random subsets. You put 10 as it is and does not make q=9?

who is right here?

ADD REPLYlink written 4.9 years ago by ChIP480
1
gravatar for sjorsvanheuveln
2.7 years ago by
Netherlands
sjorsvanheuveln40 wrote:

If you want to know what the chance is for exactly q+1 number of overlapping genes, you should subtract the the phyper function like this:

phyper(q, m, n, k, lower.tail = F) - phyper(q+1, m, n, k, lower.tail = F)

(with q the number of overlaps -1)

This is because phyper gives the chance for q+1 overlaps OR MORE. So subtracting q+2 will give you the probabability of EXACTLY q+1 overlaps.
I checked it with the following script which prints the probability distribution. The total sum of probabilities is exactly 1 here.

tot = 0
cat('o','p',sep=' ')
for (hits in 0:100){tot = tot + (phyper(q=hits-1,m=100,n=20000-100,k=100, lower.tail=F)-phyper(q=hits,m=100,n=20000-100,k=100, lower.tail=F))
cat(hits,phyper(q=hits-1,m=100,n=20000-100,k=100, lower.tail=F)-phyper(q=hits,m=100,n=20000-100,k=100, lower.tail=F),'\n',sep=' ')}
tot

I thought I'd mention this, because while I was reading this I was under the impression that the original answer asked for the chance for an EXACT number of overlaps.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by sjorsvanheuveln40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 616 users visited in the last hour