Tutorial: Finding the significance of the overlap between 2 or more gene sets using simulation in R.
gravatar for rpolicastro
4 months ago by
Bloomington, IN
rpolicastro3.2k wrote:

TLDR: Example R function to calculate significance of overlap of 2 or more gene sets. genes_all is a vector that contains all genes, and gene_sets takes a list of vectors for each gene set. I encourage people to read the full tutorial and attempt to reproduce the code themselves (especially since there are quicker ways to do the simulation).


overlap_significance <- function(genes_all, gene_sets, iterations) {
  observed <- length(reduce(gene_sets, intersect))
  simulated <- map_dbl(seq_len(iterations), function(x) {
    sim <- map(lengths(gene_sets), ~sample(genes_all, .x))
    sim <- length(reduce(sim, intersect))
  pval <- (sum(simulated >= observed) + 1) / (iterations + 1)
  return(list(pval=pval, simulated_values=simulated, observed=observed))

I often see the question asked about calculating the significance of overlap of 2 or more gene sets. There are indeed tools available for this with 2 to 3 gene sets. However, I wanted to take this opportunity to introduce the power of simulation and resampling to easily tackle this question for 2 or more gene sets in R.

First I generate a set of 10,000 genes for use in this tutorial, and then select at random 10 genes that will be guaranteed to overlap in all example datasets.

all_genes <- sprintf("ENSG%08d", seq_len(10000))
common_genes <- sample(all_genes, 10)

> head(all_genes, 5)
[1] "ENSG00000001" "ENSG00000002" "ENSG00000003" "ENSG00000004" "ENSG00000005"

The simple case of 2 Gene Sets

For the significance of overlap between just 2 gene sets, the question is often modeled using a hypergeometric distribution. Using the example given in R, the hypergeometric distribution is concerned with modeling the probability of getting x white marbles if you randomly pick (without replacement) k marbles from a jar with m white marbles and n black marbles. This tutorial will not go over this concept in detail, since I only bring it up to point out that this distribution is used to generate a null hypothesis that the overlap between the gene sets could plausibly occur through random sampling. For the case of 2 gene sets, I will show you that our simulated null distribution will match the hypergeometric null distribution.

From the example gene set I made above, I'll make two gene sets that are guaranteed to have an overlap of at least 10 genes.


# Create 2 gene sets.
sets <- map(seq_len(2), function(x) {
  c(common_genes, sample(all_genes, sample(seq(200, 500), 1)))

> lengths(sets)
[1] 487 240

# Get the overlapping number of genes.
observed <- length(reduce(sets, intersect))

> observed
[1] 16

To simulate the null distribution for the overlap of two genes sets, for 10,000 iterations I will create two gene sets the same size as the original gene sets by randomly sampling genes without replacement from the full gene list, and then compute the number of overlapping genes. I will also generate a hypergeometric null distribution for comparison based on sampling 10,000 values from a hypergeometric distribution.

# Hypergeometric null distribution.
hyper <- rhyper(
  nn=length(all_genes), m=length(sets[[1]]),
  n=length(all_genes) - length(sets[[1]]),

# Simulated null distribution.
simulated <- map_dbl(seq_len(10000), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))

Here is a plot that shows the null distribution methodologies side to side.

data.frame(Simulated=simulated, Hypergeometric=hyper) %>%
  pivot_longer(everything(), names_to="Method", values_to="Overlap") %>%
  ggplot(aes(x=Overlap, fill=Method)) +
    geom_histogram(binwidth=1) +
    geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
    theme_bw() +
    theme(text=element_text(size=12), legend.position="none") +
    facet_grid(Method~.) +
    scale_fill_manual(values=c("dodgerblue", "seagreen")) +

null set of 2

The plot of the two distributions look nearly identical. The grey dotted line in the observed value, and by cursory inspection of its location, it can be assumed that the p-value will be fairly high.

The p-values for the simulated method is simply the number of simulated values that were greater than or equal to the observed value (plus one), over the number of iterations (plus one). I've also included the p-value for the hypergeometric test, which is the probability of getting the observed value or greater under the hypergeometric null distribution.

sim_pval <- (sum(simulated >= observed) + 1) / (10000 + 1)

> sim_pval
[1] 0.1330867

hyper_pval <- phyper(
  q=observed, m=length(sets[[1]]),
  n=length(all_genes) - length(sets[[1]]),
  k=length(sets[[2]]), lower.tail=FALSE

> hyper_pval
[1] 0.07750621

The Complicated Case of 3 or More Gene Sets

If we were to step up to comparing the overlap of 3 or more gene sets, the problem becomes more difficult since we would need to use a multivariate hypergeomtric distribution. However, the code we wrote for the simulation can used as is with as many gene sets as we want. Here is an example of using 3 gene sets.

# Generate 3 random gene sets with at least 10 genes overlapping.
sets <- map(seq_len(3), function(x) {
  c(common_genes, sample(all_genes, sample(seq(200, 500), 1)))

> lengths(sets)
[1] 375 341 305

# Observed overlap.  
observed <- length(reduce(sets, intersect))

> observed
[1] 10

# Simulated overlap.
simulated <- map_dbl(seq_len(10000), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))

For the case of 3 gene sets, I've generated another plot below. Comparing this to the example with 2 gene sets from above, the observed value is much higher than most of the simulated values, so you would expect a fairly low p-value.

ggplot() +
  geom_histogram(mapping=aes(x=simulated), binwidth=1, fill="dodgerblue") +
  geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
  theme_bw() +
  theme(text=element_text(size=12)) +
  xlab("Overlap") +

null set 3

pval <- (sum(simulated >= observed) + 1) / (10000 + 1)

> pval
[1] 9.999e-05

As expected the p-value is low.

Session info provided for posterity.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.12.8 forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2      
 [5] purrr_0.3.4       readr_1.3.1       tidyr_1.1.2       tibble_3.0.3     
 [9] ggplot2_3.3.2     tidyverse_1.3.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     pillar_1.4.6     compiler_4.0.2   cellranger_1.1.0
 [5] dbplyr_1.4.4     tools_4.0.2      digest_0.6.25    jsonlite_1.7.0  
 [9] lubridate_1.7.9  lifecycle_0.2.0  gtable_0.3.0     pkgconfig_2.0.3 
[13] rlang_0.4.7      reprex_0.3.0     cli_2.0.2        rstudioapi_0.11 
[17] DBI_1.1.0        haven_2.3.1      withr_2.2.0      xml2_1.3.2      
[21] httr_1.4.2       fs_1.5.0         generics_0.0.2   vctrs_0.3.4     
[25] hms_0.5.3        grid_4.0.2       tidyselect_1.1.0 glue_1.4.2      
[29] R6_2.4.1         fansi_0.4.1      readxl_1.3.1     farver_2.0.3    
[33] modelr_0.1.8     blob_1.2.1       magrittr_1.5     backports_1.1.9 
[37] scales_1.1.1     ellipsis_0.3.1   rvest_0.3.6      assertthat_0.2.1
[41] colorspace_1.4-1 labeling_0.3     stringi_1.4.6    munsell_0.5.0   
[45] broom_0.7.0      crayon_1.3.4
tidyverse tutorial R gene • 496 views
ADD COMMENTlink modified 17 days ago by sc@79130 • written 4 months ago by rpolicastro3.2k


Thank you for your wonderful explanation on Finding the significance of the overlap between 2 or more gene sets using simulation in R. I have a three-gene list of Arabidopsis. I understood the part with two gene lists but still not very clear with three or more gene list. I want to know the significance level (p-value) of intersection/overlap genes between these gene lists. The structure of my experiments looks like: Total genes in the Arabidopsis genome is 27000 Geneset_A: 4265 Geneset_B: 9030 Geneset_C: 2969 The common number of genes among the three sets: 327

Any test/script suggestions would be really appreciated.

Thank you! Best


ADD REPLYlink written 17 days ago by sc@79130

First, create a list in R containing the genes from the 3 gene sets, and also get the number of observed overlapping genes.


gene_sets <- list(Geneset_A, Geneset_B, Geneset_C)
observed <- length(reduce(gene_sets, intersect))

You will also need a vector containing all gene names in Arabidopsis. We'll call that all_genes.

When you have this you can then calculate the p-value for overlap between the three sets.

n_iter <- 10000

simulated <- map_dbl(seq_len(n_iter), function(x) {
   sim <- map(lengths(gene_sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))

pval <- (sum(simulated >= observed) + 1) / (n_iter + 1)
ADD REPLYlink modified 17 days ago • written 17 days ago by rpolicastro3.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2232 users visited in the last hour