Question: Calculating statistical significance of overlap between two groups of genes - adapting hypergeometric test for my situation?
gravatar for Rubal
3.2 years ago by
Rubal250 wrote:


I have several lists of genes from a test I have run on genomes of different organisms. I have calculated the percentage of sharing between these lists in a pairwise fashion. I want to test whether the pairwise sharing is greater than expected by chance. The hypergeometric test appears to be the standard approach to do this. However, the versions of the test I have seen require you to input as background the number of genes in the genome. As this varies between each of lists (each is from a different species) I do not think I can implement the standard versions of the test available online.

I have been trying to generate distributions of expected sharing using resampling from my actual background genelists as I think this is the most appropriate solution (see ) but calculating so many pairwise comparisons is currently beyond my coding ability. To me resampling lists of the same size as those I observe from my test from the appropriate genomic background for each species then calculating the overlap from these subsamplings is the best possible approach to test whether the overlaps I observe in the real data are significant.

Any advice of tools I might be able to use, comments on the validity of my approach, or help with the code would be very appreciated.

Thanks for your assistance.

overlap significance genes R genome • 3.0k views
ADD COMMENTlink modified 3.2 years ago by Nicolas Rosewick8.3k • written 3.2 years ago by Rubal250
gravatar for Nicolas Rosewick
3.2 years ago by
Belgium, Brussels
Nicolas Rosewick8.3k wrote:

In R it's quiet straghtforward in fact. Something like this should work (not tested)

geneListA # vector of genes from organism A
geneListB # vectof of genes from organism B
N <- 1000 # simulation iteration

gA # list of genes of interest from organism A
gB # list of genes of interest from organism B

inter.r <- length(intersect(gA,gB)) # number of gene in common 

getSimIntersect <- function(gA,gB,geneListA,geneListB){
  gA.sim <- sample(geneListA,length(gA))
  gB.sim <- sample(geneListB,length(gB))

inter.sim <- replicate(N,getSimIntersect(gA,gB,geneListA,geneListB))

p <- sum(inter.sim>=inter.r)/N
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Nicolas Rosewick8.3k

thanks so much, I'll play around with this.

ADD REPLYlink written 3.2 years ago by Rubal250

One issue I have is that for some lists the sampling overlap is never as high as the observed overlap, so P-values are reported as 0. I would still like a way to assess whether some lists have a lot more sharing compared to random sampling than others. For example if in some cases up to 50% of genes are overlapping in the resampling compared to 60% in the real data, that is different than if a maximun of are 2% observed as overlapping in the resampling, but in both cases p-values are reported as 0. Is it straightforward to modify the code to output something like the mean percentage sharing between lists from across the resamplings? Then one could assess the magnitude of difference between the sampling and the observed data in cases where p-values are <0.001. Many thanks

ADD REPLYlink written 3.2 years ago by Rubal250

p value is not = 0 . If you do N=100,000 iterations the only thing you can say is that p < 1e-06 if you do not see any simulated intersection bigger than the real one. Nevertheless you can report the mean simulated intersection also.

ADD REPLYlink written 3.2 years ago by Nicolas Rosewick8.3k

I completely agree, just meant that R reports it as 0 but ofcourse you can only say it's smaller than your number of iterations.

ADD REPLYlink written 3.2 years ago by Rubal250

exactly. Maybe you could restrict your geneListA and B by taking into account for expression (only take expressed genes for example) in order to be more close to the reality. But that depends on your original question you want to answer.

ADD REPLYlink written 3.2 years ago by Nicolas Rosewick8.3k

Good thought, though this is for genomic analyses of DNA not RNA, so I have to include the entire genomic background as in theory any gene could appear. I think all sets are exposed to the same underlying biases that might results in a base level of sharing that is above the amount seen from random sampling. Therefore I think it's important to quantify which lists show an even higher level of sharing than the others even if all of them are technically significantly more than observed by chance.

ADD REPLYlink written 3.2 years ago by Rubal250

just to confirm, I think for the line starting 'gBsim <-' you mean to sample from geneListB not geneListA correct? Also , for the subsampling section I get the error: Error in FUN(X[[i]], ...) : unused argument (X[[i]]) Trying to troubleshoot it now

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Rubal250

do you have any idea what might be causing this error? I checked that all the input files are present and correct.

ADD REPLYlink written 3.2 years ago by Rubal250

I modify the code to use replicate instead of sapply. It should work now

ADD REPLYlink written 3.2 years ago by Nicolas Rosewick8.3k

great this works. Thanks a lot! I have 10 lists and want to do all pairwise comparisons between them. I would plan to reasign geneListA ,geneListB, gA and gB each time and rerun this. Is there a much more efficient way? Don't worry if not, I'm happy to do it this way. Thanks again

ADD REPLYlink written 3.2 years ago by Rubal250
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1874 users visited in the last hour