Entering edit mode
13 days ago
Emy Alade
▴
20
Hello,
I performed a Gene Set Enrichment Analysis (GSEA) using the clusterProfiler package in R on a ranked gene list.
I want to display the genes names directly on the enrichment plot (like in the example picture below).
library(readr)
library(ggplot2)
library(dplyr)
library(clusterProfiler)
library(biomaRt)
library(org.Hs.eg.db)
library(patchwork)
library(msigdbr)
library(enrichplot)
# Perform GSEA using gseGO function on the ranked gene list
gseGO_result <- gseGO(
geneList = gene_list_unique,
ont = "BP", # Use Biological Process ontology
OrgDb = org.Hs.eg.db, # Human gene annotation database
keyType = "ENTREZID", # Key type for gene IDs
pvalueCutoff = 0.05, # p-value cutoff for significant terms
pAdjustMethod = "BH", # Multiple testing correction method (Benjamini-Hochberg)
minGSSize = 10, # Minimum gene set size
maxGSSize = 500, # Maximum gene set size
verbose = TRUE,
seed = TRUE # Set seed for reproducibility
)
# Simplify GSEA results by removing redundant GO terms based on similarity cutoff
gseGO_result2 <- clusterProfiler::simplify(
gseGO_result,
cutoff = 0.7, # Similarity cutoff to merge redundant terms
by = "p.adjust", # Use adjusted p-value for grouping
select_fun = min # Select term with minimal adjusted p-value
)
# Convert ENTREZ IDs to gene symbols for easier interpretation
gene_df <- bitr(
names(gene_list_unique),
fromType = "ENTREZID",
toType = "SYMBOL",
OrgDb = org.Hs.eg.db
)
# Add gene symbols to the core enrichment column of the gseGO results
gseGO_result2@result$core_enrichment_symbol <- sapply(
gseGO_result2@result$core_enrichment,
function(x) {
entrez_ids <- unlist(strsplit(x, "/"))
symbols <- gene_df$SYMBOL[match(entrez_ids, gene_df$ENTREZID)]
paste(symbols, collapse = "/")
}
)
plot <- gseaplot2(gseGO_result2,
geneSetID = "GO:0071824",
pvalue_table = TRUE,
color = "#86B875",
ES_geom = "line")
# Extraire les gènes "leading edge"
genes <- gseGO_result2@result[gseGO_result2@result$Description == "protein-DNA complex assembly", "core_enrichment"]
genes <- unlist(strsplit(genes, "/"))
deg_in_list <- sigDEGs_ids[sigDEGs_ids$ENTREZID %in% genes, ]
deg_in_list$rank <- match(deg_in_list$ENTREZID, names(gene_list_unique))
label_df <- na.omit(deg_in_list)
p3 <- plot[[3]]
# Préparer le dataframe pour les petites barres bleues à y=0
bars_df <- data.frame(
x = label_df$rank,
y = 0,
xend = label_df$rank,
yend = gene_list_unique[label_df$rank]
)
# Construire le plot
plot[[3]] <- p3 +
# Barres grises de tous les gènes
geom_segment(data = data.frame(
x = seq_along(gene_list_unique),
y = 0,
xend = seq_along(gene_list_unique),
yend = gene_list_unique),
aes(x = x, y = y, xend = xend, yend = yend),
color = "grey70") +
# Ligne horizontale à y=0
geom_hline(yintercept = 0, linetype = "dashed") +
# Petites barres bleues sur la baseline pour les gènes annotés
geom_segment(data = bars_df,
aes(x = x, y = y, xend = xend, yend = yend),
color = "blue",
size = 0.5) +
# Texte annoté avec flèches partant de y=0 vers le texte au-dessus
geom_text_repel(data = label_df,
aes(x = rank, y = 0, label = SYMBOL),
nudge_y = diff(range(gene_list_unique)) * 0.07, # ajuster la hauteur du texte
angle = 90,
size = 3.5,
segment.size = 0.3,
segment.curvature = 0,
segment.ncp = 10,
segment.angle = 20) +
labs(x = "Rank in ordered dataset", y = "Ranked list metric") +
theme_minimal()
# Affichage avec patchwork, augmenter la hauteur du 3e plot si besoin
plot[[1]] / plot[[2]] / plot[[3]] + plot_layout(heights = c(3, 1, 3))
Hi Emy, it looks like from your worked example, you've managed to label some genes. What specifically would you like help with?
I want the genes to be clearly readable and not overlapping (as in the first image). Normally, five genes should be displayed
You're more or less there...
Take a look at the ggrepel example below:
https://ggrepel.slowkow.com/articles/examples#limit-labels-to-a-specific-area
What I would recommend is that you select your genes of interest and feed them in via
label_df
then use a combination ofgeom_text_repel(xlim = ..., ylim = ..., nudge_x = ..., nudge_y = ...)
to correctly define the regions that ggrepel is able to use. You may need to expand the y-axis limits for the plot go provide more space for the labels if you plan to use genes with very long names.As an aside, I assume you want to code this rather than fix it in Illustrator because you plan to be generating many of these plots in a batch. I would still say there is an upper limit to how many of these plots are useful to show in this way. If you are going to be showing more than 3 or 4 of these, then maybe you could consider an alternative visualisation?
If you're able to share
gene_list_unique
,bars_df
,label_df
andgseGO_result2
, or a dummy example of each I could help you with some code.