Tutorial: Gene Set Enrichment Analysis
gravatar for Hamid Ghaedi
3 months ago by
Hamid Ghaedi1.2k
Hamid Ghaedi1.2k wrote:

Gene Set Enrichment Analysis

Beginner level

The original post for this tutorial could be found on my GitHub. Please refer to the very end of the page for the references list that I used to make this thread available.


Assume we performed an RNA-seq (or microarray gene expression) experiment and now want to know what pathway/biological process shows enrichment for our [differentially expressed] genes. There are terminologies someone may need to be familiar with before diving into the topic thoroughly:

Gene set

A gene set is an unordered collection of genes that are functionally related.


A pathway can be interpreted as a gene set by ignoring functional relationships among genes.

Gene Ontology (GO)

GO describe gene function. A gene role/function could be attributed to the three main classes:

Molecular Function (MF) : which defines molecular activities of gene products.

Cellular Component (CC) : This describes where gene products are active/localized.

Biological Process (BP) : which describes pathways and processes that the gene product is active in.

Kyoto Encyclopedia of Genes and Genomes (KEGG)

KEGG is a collection of manually curated pathway maps representing molecular interaction and reaction networks.

There are different methods widely used for functional enrichment analysis:

1- Over Representation Analysis (ORA): This is the simplest version of enrichment analysis and at the same time the most widely used approach. The concept in this approach is based on a Fisher exact test p-value in a contingency table. There is a relatively large number of web-tools R package for ORA. Personally I am a fan of DAVID web tools however its last update was in 2016 (DAVID 6.8 Oct. 2016).

2- Gene Set Enrichment Analysis (GSEA):

It was developed by Broad Institute. This is the preferred method when genes are coming from an expression experiment like microarray and RNA-seq. However, the original methodology was designed to work on microarray but later modification made it suitable for RNA-seq also. In this approach, you need to rank your genes based on a statistic (like what DESeq2 provide), and then perform enrichment analysis against different pathways (= gene set). You have to download the gene set files into your local system. The point is that here the algorithm will use all genes in the ranked list for enrichment analysis. [in contrast to ORA where only genes passed a specific threshold (like DE ones) would be used for enrichment analysis]. You can find more details about the methodology on the original PNAS paper.. To download these gene sets in a folder go to the MSigDB website, register, and download the data.

Steps in this tutorial:

(1)differential expression analysis,

(2) doing GSEA

and finally (3) Visualization.

    #_______________Loading packages______________________________#
#________________diffrential expression analysis______________#
# reading expression data matrix and getting rid of what we dont like

# my data is raw count data comming from RNA-seq on 476 bladder cancer sample
rna <- read.table("Uromol1_CountData.v1.csv", header = T, sep = ",")

#  [1] 60488   477
head(rna, 5)
# setting row name for expression matrix
rownamefordata <- substr(rna$genes,1,15)
rownames(rna) <- rownamefordata
# removing column with gene name
rna <- rna[, -1]

# reading clinical data 
# read uromol clinical
uromol_clin <- read.table("uromol_clinic.csv", sep = ",", header = T)
#Setting row name for clinical data
ids <- uromol_clin$UniqueID
rownames(uromol_clin) <- ids

# I want to compare samples with Ta stage against those in higher stage (non_Ta)
# making column for comparison


# CIS   T1 T2-4   Ta 
# 3     112   16  345 

# selecting only Ta samples
uromol_clin$isTa <- as.factor(ifelse(uromol_clin$Stage == "Ta", "Ta", "non_Ta"))
# [1] "non-Ta" "Ta" 

# making sure dataset are compatible regrding to sample order
rna <- rna[, row.names(uromol_clin)]
all(rownames(uromol_clin) %in% colnames(rna))

#Making DESeqDataSet object which stores all experiment data
dds <- DESeqDataSetFromMatrix(countData = rna,
                              colData = uromol_clin,
                              design = ~ isTa)

# prefilteration: it is not necessary but recommended to filter out low expressed genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# data tranfromation
vsd <- vst(dds, blind=FALSE)

dds <- DESeq(dds) #This would take some time
res <- results(dds, alpha=0.05)

# Steps toward doing gene set enrichment analysis (GSEA):

# 1- obtaining a stats for ranking genes in your experiment,
# 2- creating a named vector out of the DESeq2 result
# 3- Obtaining a gene set from mysigbd
# 4- doing analysis

# already we performed DESeq2 analysis and have statistics for working on it
res$row <- rownames(res)

# important notice: if you have not such stats in your result (say comming from edgeR),
# you may need to create a rank metric for your genes. To do this:
# metric = -log10(pvalue)/sign(log2FC)

# Map Ensembl gene IDs to symbol. First create a mapping table.
ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
names(ens2symbol)[1] <- "row"

ens2symbol <- as_tibble(ens2symbol)
# joining
res <- merge(data.frame(res), ens2symbol, by=c("row"))

# remove the NAs, averaging statitics for a multi-hit symbol
res2 <- res %>% 
  select(SYMBOL, stat) %>% 
  na.omit() %>% 
  distinct() %>% 
  group_by(SYMBOL) %>% 

# creating  a named vector [ranked genes]
ranks <- res2$stat
names(ranks) <- res2$SYMBOL

# Load the pathway (gene set) into a named list
# downloaded mysigdb were located in my "~" directory:
pathways.hallmark <- gmtPathways("~/mysigdb/h.all.v7.2.symbols.gmt")

# show a few lines from the pathways file

#Running fgsea algorithm:
fgseaRes <- fgseaMultilevel(pathways=pathways.hallmark, stats=ranks)

# Tidy the results:
fgseaResTidy <- fgseaRes %>%
  as_tibble() %>%
  arrange(desc(NES)) # order by normalized enrichment score (NES)

# To see what genes are in each of these pathways:
gene.in.pathway <- pathways.hallmark %>% 
  enframe("pathway", "SYMBOL") %>% 
  unnest(cols = c(SYMBOL)) %>% 
  inner_join(res, by="SYMBOL")


#__________bar plot _______________#
# Plot the normalized enrichment scores. 
#Color the bar indicating whether or not the pathway was significant:
fgseaResTidy$adjPvalue <- ifelse(fgseaResTidy$padj <= 0.05, "significant", "non-significant")
cols <- c("non-significant" = "grey", "significant" = "red")
ggplot(fgseaResTidy, aes(reorder(pathway, NES), NES, fill = adjPvalue)) +
  geom_col() +
  scale_fill_manual(values = cols) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  coord_flip() +
  labs(x="Pathway", y="Normalized Enrichment Score",
  title="Hallmark pathways Enrichment Score from GSEA")

enter image description here

#__________Enrichment  Plot_______#
# Enrichment plot for E2F target gene set
plotEnrichment(pathway = pathways.hallmark[["HALLMARK_E2F_TARGETS"]], ranks)

enter image description here


plotGseaTable(pathways.hallmark[fgseaRes$pathway[fgseaRes$padj < 0.05]], ranks, fgseaRes, 

enter image description here

#________ Heatmap Plot_____________#
# pathways with significant enrichment score
sig.path <- fgseaResTidy$pathway[fgseaResTidy$adjPvalue == "significant"]
sig.gen <- unique(na.omit(gene.in.pathway$SYMBOL[gene.in.pathway$pathway %in% sig.path]))

### create a new data-frame that has '1' for when a gene is part of a term, and '0' when not
h.dat <- dcast(gene.in.pathway[, c(1,2)], SYMBOL~pathway)
rownames(h.dat) <- h.dat$SYMBOL
h.dat <- h.dat[, -1]

h.dat <- h.dat[rownames(h.dat) %in% sig.gen, ]
h.dat <- h.dat[, colnames(h.dat) %in% sig.path]

# keep those genes with 3  or more occurnes

# 1       2    3    4    5    6 
# 1604  282   65   11    1    1 
h.dat <- h.dat[data.frame(rowSums(h.dat)) >= 3, ]

topTable <- res[res$SYMBOL %in% rownames(h.dat), ]
rownames(topTable) <- topTable$SYMBOL
# match the order of rownames in toptable with that of h.dat
topTableAligned <- topTable[which(rownames(topTable) %in% rownames(h.dat)),]
topTableAligned <- topTableAligned[match(rownames(h.dat), rownames(topTableAligned)),]
all(rownames(topTableAligned) == rownames(h.dat))

# colour bar for -log10(adjusted p-value) for sig.genes
dfMinusLog10FDRGenes <- data.frame(-log10(
 topTableAligned[which(rownames(topTableAligned) %in% rownames(h.dat)), 'padj']))
dfMinusLog10FDRGenes[dfMinusLog10FDRGenes == 'Inf'] <- 0

# colour bar for fold changes for sigGenes
dfFoldChangeGenes <- data.frame(
 topTableAligned[which(rownames(topTableAligned) %in% rownames(h.dat)), 'log2FoldChange'])

# merge both
dfGeneAnno <- data.frame(dfMinusLog10FDRGenes, dfFoldChangeGenes)
colnames(dfGeneAnno) <- c('Gene score', 'Log2FC')
dfGeneAnno[,2] <- ifelse(dfGeneAnno$Log2FC > 0, 'Up-regulated',
                        ifelse(dfGeneAnno$Log2FC < 0, 'Down-regulated', 'Unchanged'))
colours <- list(
 'Log2FC' = c('Up-regulated' = 'royalblue', 'Down-regulated' = 'yellow'))
haGenes <- rowAnnotation(
 df = dfGeneAnno,
 col = colours,
 width = unit(1,'cm'),
 annotation_name_side = 'top')

# Now a separate color bar for the GSEA enrichment padj. This will 
# also contain the enriched term names via annot_text()

# colour bar for enrichment score from fgsea results
dfEnrichment <- fgseaRes[, c("pathway", "NES")]
dfEnrichment <- dfEnrichment[dfEnrichment$pathway %in% colnames(h.dat)]
dd <- dfEnrichment$pathway
dfEnrichment <- dfEnrichment[, -1]
rownames(dfEnrichment) <- dd
colnames(dfEnrichment) <- 'Normalized\n Enrichment score'
haTerms <- HeatmapAnnotation(
 df = dfEnrichment,
 Term = anno_text(
   rot = 45,
   just = 'right',
   gp = gpar(fontsize = 12)),
 annotation_height = unit.c(unit(1, 'cm'), unit(8, 'cm')),
 annotation_name_side = 'left')
# now generate the heatmap
hmapGSEA <- Heatmap(h.dat,
                   name = 'GSEA hallmark pathways enrichment',
                   split = dfGeneAnno[,2],
                   col = c('0' = 'white', '1' = 'forestgreen'),
                   rect_gp = gpar(col = 'grey85'),
                   cluster_rows = TRUE,
                   show_row_dend = TRUE,
                   row_title = 'Top Genes',
                   row_title_side = 'left',
                   row_title_gp = gpar(fontsize = 11, fontface = 'bold'),
                   row_title_rot = 90,
                   show_row_names = TRUE,
                   row_names_gp = gpar(fontsize = 11, fontface = 'bold'),
                   row_names_side = 'left',
                   row_dend_width = unit(35, 'mm'),
                   cluster_columns = TRUE,
                   show_column_dend = TRUE,
                   column_title = 'Enriched terms',
                   column_title_side = 'top',
                   column_title_gp = gpar(fontsize = 12, fontface = 'bold'),
                   column_title_rot = 0,
                   show_column_names = FALSE,
                   show_heatmap_legend = FALSE,
                   clustering_distance_columns = 'euclidean',
                   clustering_method_columns = 'ward.D2',
                   clustering_distance_rows = 'euclidean',
                   clustering_method_rows = 'ward.D2',
                   bottom_annotation = haTerms)

tiff("GSEA_enrichment_2.tiff", units="in", width=13, height=22, res=400)
draw(hmapGSEA + haGenes,
    heatmap_legend_side = 'right',
    annotation_legend_side = 'right')

enter image description here


1- clusterProfiler: universal enrichment tool for functional and comparative study

2- Fast Gene Set Enrichment Analysis

3- Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

4- DAVID Bioinformatics Resources 6.8

5- DESeq results in pathways in 60 Seconds with the fgsea package

6- Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures

7- Clustering of DAVID gene enrichment results from gene expression studies by Kevin Blighe.

ADD COMMENTlink modified 8 days ago • written 3 months ago by Hamid Ghaedi1.2k

Thank you very much for this tutorial. Just one question, when I run

fgseaRes <- fgseaMultilevel(pathways=pathways.hallmark, stats=ranks)

I'm getting the following message:

Error in mcexit(0L) : write error, closing pipe to the master

Do you what this could be?


ADD REPLYlink modified 6 weeks ago by _r_am32k • written 7 weeks ago by brunobsouzaa490

Hi @brunobsouzaa, This error is new to me.These post may be helpful:

1- https://stackoverflow.com/questions/51986674/mclapply-sendmaster-error-only-with-rscript

2- https://r.789695.n4.nabble.com/error-in-parallel-sendMaster-td4760382.html

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Hamid Ghaedi1.2k

Thank you very much for the links... After a deep inspection, it seems that it's just a warning message. Data looks ok!

ADD REPLYlink written 5 weeks ago by brunobsouzaa490

Please correct me if I'm wrong but I don't think that you need

distinct() %>%

if you are combining

 group_by(SYMBOL) %>% 

as it treats every SYMBOL as a grouping variable anyway and will always return the mean value for every repetitive.

ADD REPLYlink modified 2 days ago • written 2 days ago by Adrian Janucik0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1355 users visited in the last hour