Tutorial:Finding the significance of the overlap between 2 or more gene sets using simulation in R.
Entering edit mode
7 months ago

TLDR: Example R function to calculate significance of overlap of 2 or more gene sets. genes_all is a vector that contains all genes, and gene_sets takes a list of vectors for each gene set. I encourage people to read the full tutorial and attempt to reproduce the code themselves (especially since there are quicker ways to do the simulation).


overlap_significance <- function(genes_all, gene_sets, iterations) {
  observed <- length(reduce(gene_sets, intersect))
  simulated <- map_dbl(seq_len(iterations), function(x) {
    sim <- map(lengths(gene_sets), ~sample(genes_all, .x))
    sim <- length(reduce(sim, intersect))
  pval <- (sum(simulated >= observed) + 1) / (iterations + 1)
  return(list(pval=pval, simulated_values=simulated, observed=observed))

I often see the question asked about calculating the significance of overlap of 2 or more gene sets. There are indeed tools available for this with 2 to 3 gene sets. However, I wanted to take this opportunity to introduce the power of simulation and resampling to easily tackle this question for 2 or more gene sets in R.

First I generate a set of 10,000 genes for use in this tutorial, and then select at random 10 genes that will be guaranteed to overlap in all example datasets.

all_genes <- sprintf("ENSG%08d", seq_len(10000))
common_genes <- sample(all_genes, 10)

> head(all_genes, 5)
[1] "ENSG00000001" "ENSG00000002" "ENSG00000003" "ENSG00000004" "ENSG00000005"

The simple case of 2 Gene Sets

For the significance of overlap between just 2 gene sets, the question is often modeled using a hypergeometric distribution. Using the example given in R, the hypergeometric distribution is concerned with modeling the probability of getting x white marbles if you randomly pick (without replacement) k marbles from a jar with m white marbles and n black marbles. This tutorial will not go over this concept in detail, since I only bring it up to point out that this distribution is used to generate a null hypothesis that the overlap between the gene sets could plausibly occur through random sampling. For the case of 2 gene sets, I will show you that our simulated null distribution will match the hypergeometric null distribution.

From the example gene set I made above, I'll make two gene sets that are guaranteed to have an overlap of at least 10 genes.


# Create 2 gene sets.
sets <- map(seq_len(2), function(x) {
  c(common_genes, sample(all_genes, sample(seq(200, 500), 1)))

> lengths(sets)
[1] 487 240

# Get the overlapping number of genes.
observed <- length(reduce(sets, intersect))

> observed
[1] 16

To simulate the null distribution for the overlap of two genes sets, for 10,000 iterations I will create two gene sets the same size as the original gene sets by randomly sampling genes without replacement from the full gene list, and then compute the number of overlapping genes. I will also generate a hypergeometric null distribution for comparison based on sampling 10,000 values from a hypergeometric distribution.

# Hypergeometric null distribution.
hyper <- rhyper(
  nn=length(all_genes), m=length(sets[[1]]),
  n=length(all_genes) - length(sets[[1]]),

# Simulated null distribution.
simulated <- map_dbl(seq_len(10000), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))

Here is a plot that shows the null distribution methodologies side to side.

data.frame(Simulated=simulated, Hypergeometric=hyper) %>%
  pivot_longer(everything(), names_to="Method", values_to="Overlap") %>%
  ggplot(aes(x=Overlap, fill=Method)) +
    geom_histogram(binwidth=1) +
    geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
    theme_bw() +
    theme(text=element_text(size=12), legend.position="none") +
    facet_grid(Method~.) +
    scale_fill_manual(values=c("dodgerblue", "seagreen")) +

null set of 2

The plot of the two distributions look nearly identical. The grey dotted line in the observed value, and by cursory inspection of its location, it can be assumed that the p-value will be fairly high.

The p-values for the simulated method is simply the number of simulated values that were greater than or equal to the observed value (plus one), over the number of iterations (plus one). I've also included the p-value for the hypergeometric test, which is the probability of getting the observed value or greater under the hypergeometric null distribution.

sim_pval <- (sum(simulated >= observed) + 1) / (10000 + 1)

> sim_pval
[1] 0.1330867

hyper_pval <- phyper(
  q=observed, m=length(sets[[1]]),
  n=length(all_genes) - length(sets[[1]]),
  k=length(sets[[2]]), lower.tail=FALSE

> hyper_pval
[1] 0.07750621

The Complicated Case of 3 or More Gene Sets

If we were to step up to comparing the overlap of 3 or more gene sets, the problem becomes more difficult since we would need to use a multivariate hypergeomtric distribution. However, the code we wrote for the simulation can used as is with as many gene sets as we want. Here is an example of using 3 gene sets.

# Generate 3 random gene sets with at least 10 genes overlapping.
sets <- map(seq_len(3), function(x) {
  c(common_genes, sample(all_genes, sample(seq(200, 500), 1)))

> lengths(sets)
[1] 375 341 305

# Observed overlap.  
observed <- length(reduce(sets, intersect))

> observed
[1] 10

# Simulated overlap.
simulated <- map_dbl(seq_len(10000), function(x) {
   sim <- map(lengths(sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))

For the case of 3 gene sets, I've generated another plot below. Comparing this to the example with 2 gene sets from above, the observed value is much higher than most of the simulated values, so you would expect a fairly low p-value.

ggplot() +
  geom_histogram(mapping=aes(x=simulated), binwidth=1, fill="dodgerblue") +
  geom_vline(xintercept=observed, lty=2, color="grey", size=1) +
  theme_bw() +
  theme(text=element_text(size=12)) +
  xlab("Overlap") +

null set 3

pval <- (sum(simulated >= observed) + 1) / (10000 + 1)

> pval
[1] 9.999e-05

As expected the p-value is low.

Session info provided for posterity.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.12.8 forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2      
 [5] purrr_0.3.4       readr_1.3.1       tidyr_1.1.2       tibble_3.0.3     
 [9] ggplot2_3.3.2     tidyverse_1.3.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     pillar_1.4.6     compiler_4.0.2   cellranger_1.1.0
 [5] dbplyr_1.4.4     tools_4.0.2      digest_0.6.25    jsonlite_1.7.0  
 [9] lubridate_1.7.9  lifecycle_0.2.0  gtable_0.3.0     pkgconfig_2.0.3 
[13] rlang_0.4.7      reprex_0.3.0     cli_2.0.2        rstudioapi_0.11 
[17] DBI_1.1.0        haven_2.3.1      withr_2.2.0      xml2_1.3.2      
[21] httr_1.4.2       fs_1.5.0         generics_0.0.2   vctrs_0.3.4     
[25] hms_0.5.3        grid_4.0.2       tidyselect_1.1.0 glue_1.4.2      
[29] R6_2.4.1         fansi_0.4.1      readxl_1.3.1     farver_2.0.3    
[33] modelr_0.1.8     blob_1.2.1       magrittr_1.5     backports_1.1.9 
[37] scales_1.1.1     ellipsis_0.3.1   rvest_0.3.6      assertthat_0.2.1
[41] colorspace_1.4-1 labeling_0.3     stringi_1.4.6    munsell_0.5.0   
[45] broom_0.7.0      crayon_1.3.4
R tidyverse gene Tutorial • 885 views
Entering edit mode


Thank you for your wonderful explanation on Finding the significance of the overlap between 2 or more gene sets using simulation in R. I have a three-gene list of Arabidopsis. I understood the part with two gene lists but still not very clear with three or more gene list. I want to know the significance level (p-value) of intersection/overlap genes between these gene lists. The structure of my experiments looks like: Total genes in the Arabidopsis genome is 27000 Geneset_A: 4265 Geneset_B: 9030 Geneset_C: 2969 The common number of genes among the three sets: 327

Any test/script suggestions would be really appreciated.

Thank you! Best


Entering edit mode

First, create a list in R containing the genes from the 3 gene sets, and also get the number of observed overlapping genes.


gene_sets <- list(Geneset_A, Geneset_B, Geneset_C)
observed <- length(reduce(gene_sets, intersect))

You will also need a vector containing all gene names in Arabidopsis. We'll call that all_genes.

When you have this you can then calculate the p-value for overlap between the three sets.

n_iter <- 10000

simulated <- map_dbl(seq_len(n_iter), function(x) {
   sim <- map(lengths(gene_sets), ~sample(all_genes, .x))
   sim <- length(reduce(sim, intersect))

pval <- (sum(simulated >= observed) + 1) / (n_iter + 1)
Entering edit mode

Dear rpolicastro,

initially thank you for the very interesting and nice implementation of comparing two or more gene-sets based on permutation testing-I would like to ask a relative question concerning the putative implementation in a specific case study scenario I'm currently facing:

briefly, I have a DEG list derived from an rnaseq gene expression analysis-then, based on various downstream goals, I would like to compare the aforementioned DEG genes with known gene targets sets from specific TFs, to investigate which of the TFs might be more "similar" to these DEG genes-

in total, except the DEG list, I have additionally 15 gene-sets corresponding to each TF-thus, is it possible to reformulate your above approach, to estimate the significance of the DEG set with each of the 15 gene-sets as pairwise tests ? In order to finally calculate p-values and rank the respective TFs ?

Thank you in advance,


Entering edit mode

In this case you just need to do 15 pairwise hypergeometric tests followed by p-value correction (to avoid the multiple-comparisons problem).

Here is some example data.

# Hypothetical organism with 1000 genes.
all_genes <- sprintf("ENSG%08d", seq_len(1000))

# Randomly picking 250 DEGs.
degs <- sample(all_genes, 250)

# Making 5 sets of genes for 5 TF-associated gene sets (with 100 genes each).
tfs <- lapply(seq_len(5), function(x) sample(all_genes, 100))
names(tfs) <- sprintf("TF%02d", seq_len(5))

We'll now loop over each TF set and calculate the hypergeometric test for the overlap with the DEGs.

# Calculating the hypergeometric p-value.
results <- lapply(tfs, function(x) {
  q <- length(intersect(x, degs))
  m <- length(x)
  n <- length(all_genes) - m
  k <- length(degs)

  p_value <- phyper(q, m, n, k, lower.tail=FALSE)

# p-value correction for multiple comparisons.
results <- p.adjust(unlist(results), "BH")

> results
     TF01      TF02      TF03      TF04      TF05 
0.6775091 0.6775091 0.8016280 0.1071783 0.4896353
Entering edit mode

Thank you very much for your quick and comprehensive explanation !! Just two important comments to fully understand your answer and original post:

1) Regarding the "all_genes" or the relative background-instead of using for example the gene symbols of all the normalized expressed genes in my dataset, I could limit the background, with the union of all the included TF gene targets ? to be more specific for my case scenario ?

2) concerning the very interesting rationale of simulated p-values: the goal, is essentially to compare the simulated p-values with the hypergeometric ones, to investigate after the permutation if the significance still exists ? and thus, strengthens the initial association ? could this also applied in my case for further evaluating the relative resulted p-values you illustrate above ?

Entering edit mode

1) You're "all_gene" at the minimum must be all of the genes not filtered out by DESeq2 because of e.g. low expression, plus all of the genes in your TF sets. This is because the test asks the question - if I pick 250 genes (DEGs) out of 10,000 total genes, what is the probability that 10 of those genes overlap with a set of 100 genes associated with TFs.

2) The reason I discussed simulation is that calculating the overlap between 2 gene sets is simple using a hypergeometric test. However, let's say you have 3 DEG lists, and within these DEG lists you have 10 genes that overlap between all 3 of them. Calculating whether this overlap is significant becomes difficult formally, but simulating a p-value is much easier. For your particular case you are only ever comparing 2 sets of genes, so you can do all of these comparisons with just the hypergeometric test, and don't necessarily need to worry about simulation.


Login before adding your answer.

Traffic: 1208 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6