I am interested in calculating the significance of the overlap between unlimited amount of gene sets.
I was able to use python's scipy hypergeom distribution for the case of 2 gene sets using the cdf function.
How could I achieve the same for an unlimited amount of gene sets?
This question becomes much easier to answer by using a simple resampling test. The null hypothesis you are testing against is that the overlap of the gene sets is indistinguishable from what you would get by random sampling. To test this perform 10,000 iterations of the following approach. For each gene set, randomly select genes from the genome equal to the number of actual genes in each set, and then compute the overlap of genes among all resampled sets. The p-value is then the number of iterations where your simulated overlap value was greater than or equal to the observed overlap (plus one), over the number of simulations (plus one).