Question: Calculating statistical significance of overlap between two groups of genes - adapting hypergeometric test for my situation?
1
2.6 years ago by
Rubal210
Germany
Rubal210 wrote:

Hello!

I have several lists of genes from a test I have run on genomes of different organisms. I have calculated the percentage of sharing between these lists in a pairwise fashion. I want to test whether the pairwise sharing is greater than expected by chance. The hypergeometric test appears to be the standard approach to do this. However, the versions of the test I have seen require you to input as background the number of genes in the genome. As this varies between each of lists (each is from a different species) I do not think I can implement the standard versions of the test available online.

I have been trying to generate distributions of expected sharing using resampling from my actual background genelists as I think this is the most appropriate solution (see http://stats.stackexchange.com/questions/232627/want-to-calculate-significance-of-pairwise-sharing-between-lists-standard-hyper ) but calculating so many pairwise comparisons is currently beyond my coding ability. To me resampling lists of the same size as those I observe from my test from the appropriate genomic background for each species then calculating the overlap from these subsamplings is the best possible approach to test whether the overlaps I observe in the real data are significant.

Any advice of tools I might be able to use, comments on the validity of my approach, or help with the code would be very appreciated.

overlap significance genes R genome • 2.3k views
modified 2.6 years ago by Nicolas Rosewick7.4k • written 2.6 years ago by Rubal210
0
2.6 years ago by
Belgium, Brussels
Nicolas Rosewick7.4k wrote:

In R it's quiet straghtforward in fact. Something like this should work (not tested)

``````geneListA # vector of genes from organism A
geneListB # vectof of genes from organism B
N <- 1000 # simulation iteration

gA # list of genes of interest from organism A
gB # list of genes of interest from organism B

inter.r <- length(intersect(gA,gB)) # number of gene in common

getSimIntersect <- function(gA,gB,geneListA,geneListB){
gA.sim <- sample(geneListA,length(gA))
gB.sim <- sample(geneListB,length(gB))
return(length(intersect(gA.sim,gB.sim)))
}

inter.sim <- replicate(N,getSimIntersect(gA,gB,geneListA,geneListB))

p <- sum(inter.sim>=inter.r)/N
``````

thanks so much, I'll play around with this.

One issue I have is that for some lists the sampling overlap is never as high as the observed overlap, so P-values are reported as 0. I would still like a way to assess whether some lists have a lot more sharing compared to random sampling than others. For example if in some cases up to 50% of genes are overlapping in the resampling compared to 60% in the real data, that is different than if a maximun of are 2% observed as overlapping in the resampling, but in both cases p-values are reported as 0. Is it straightforward to modify the code to output something like the mean percentage sharing between lists from across the resamplings? Then one could assess the magnitude of difference between the sampling and the observed data in cases where p-values are <0.001. Many thanks

p value is not = 0 . If you do N=100,000 iterations the only thing you can say is that p < 1e-06 if you do not see any simulated intersection bigger than the real one. Nevertheless you can report the mean simulated intersection also.

I completely agree, just meant that R reports it as 0 but ofcourse you can only say it's smaller than your number of iterations.

exactly. Maybe you could restrict your geneListA and B by taking into account for expression (only take expressed genes for example) in order to be more close to the reality. But that depends on your original question you want to answer.

Good thought, though this is for genomic analyses of DNA not RNA, so I have to include the entire genomic background as in theory any gene could appear. I think all sets are exposed to the same underlying biases that might results in a base level of sharing that is above the amount seen from random sampling. Therefore I think it's important to quantify which lists show an even higher level of sharing than the others even if all of them are technically significantly more than observed by chance.

just to confirm, I think for the line starting 'gBsim <-' you mean to sample from geneListB not geneListA correct? Also , for the subsampling section I get the error: Error in FUN(X[[i]], ...) : unused argument (X[[i]]) Trying to troubleshoot it now

do you have any idea what might be causing this error? I checked that all the input files are present and correct.

I modify the code to use replicate instead of sapply. It should work now