Need help with fgsea
2
0
Entering edit mode
12 months ago
bio_info ▴ 20

Hi all,

I am trying to run an fgsea analysis using the fgsea package in R for all cluster biomarkers in my scRNA-seq dataset. I have written a short script to do this. However, all the resulting pathways have a very high adjusted-p-val, around 0.9. I have tried playing with the number of top genes in my ranked vector, from 200 till 500 and also playing with the minSize parameter but I still get very high p-values. Can anyone please help me in figuring out where the mistake is?

library(fgsea)

markers <- FindAllMarkers(sobj, verbose = T, only.pos = T)

# Remove mito and ribo genes, filter by adj_p_val
markers_filtered <- markers %>%
filter(!str_detect(gene, "^MT-") & !str_detect(gene, "^RPS") & !str_detect(gene, "^RPL")) %>% 
filter(p_val_adj < 0.05) %>% 
arrange(desc(avg_log2FC))

# Prepare ranked vector
markers_ranked_vector <- list()

for (i in unique(markers_filtered$cluster)) {
cluster_data <- markers_filtered %>% 
filter(cluster == i)

top_genes <- cluster_data %>% 
arrange(desc(avg_log2FC)) %>% 
slice_head(n = 300)

# Create a named vector 
markers_ranked_vector[[as.character(i)]] <- setNames(top_genes$avg_log2FC, top_genes$gene)
}

# Download msigdb hallmark genesets
hallmark <- msigdbr::msigdbr(species = "Homo sapiens", category = "H")

msigdbr_list = split(x = hallmark$gene_symbol, f = hallmark$gs_name)

fgsea_results <- lapply(markers_ranked_vector, function(x) {
  fgsea(
    pathways = msigdbr_list,  
    stats = x,
    scoreType = "pos",
    minSize = 10
  )
})

Thank you for your time!

scRNAseq fgsea • 1.5k views
ADD COMMENT
1
Entering edit mode
12 months ago

You don't need to prefilter GSEA before running it, GSEA is designed to take in the full gene-list ranked by a statistic, not top X genes.

ADD COMMENT
0
Entering edit mode

I see, so I don't need to select the top 300 genes? I can pass the entire result of my DEG to fgsea? This includes around 100 genes per cluster...

ADD REPLY
0
Entering edit mode

I am still getting very high p-values of around 0.9, what can I do in this situation? Should I redo my clustering or use different parameters for the fgsea?

ADD REPLY
1
Entering edit mode

Have you tried using pseudo-bulk analysis for DE and then GSEA? That's generally the more recommended option as it is more stable than the default Seurat FindMarkers()

Otherwise:

markers <- FindAllMarkers(sobj, verbose = T, only.pos = FALSE) ## include negative genes - GSEA needs +ve and -ve ranked enrichment to work properly

Also consider that Hallmarks isn't always the best MSigDB to use especially when the difference in your data is more sublitle. You should try C2 and C5 at as a minimum also

ADD REPLY
0
Entering edit mode
15 days ago
Kevin Blighe ★ 90k

Your issue with high adjusted p-values in the fgsea analysis arises from multiple problems in your code and approach.

First, you set only.pos = TRUE in FindAllMarkers(). This excludes genes with negative fold changes, which are required for proper gene set enrichment analysis. Gene set enrichment analysis relies on a complete ranked list that includes both upregulated and downregulated genes to detect meaningful enrichments. Change this to only.pos = FALSE to include all differentially expressed genes.

Second, you are limiting the ranked vector to the top 300 genes per cluster. Gene set enrichment analysis is designed to use the full list of genes ranked by a statistic, such as the average log2 fold change or a signed p-value metric (for example, -log10(p_val) * sign(avg_log2FC)). Restricting to top genes biases the results and often leads to insignificant p-values. Instead, compute differential expression across all genes in the object and rank them entirely.

Third, filtering by adjusted p-value < 0.05 before ranking is not recommended for gene set enrichment analysis.

Fourth, the Hallmark gene sets may not capture subtle differences in your single-cell RNA sequencing data. Test additional Molecular Signatures Database collections, such as C2 (curated gene sets) or C5 (ontology gene sets), which provide broader coverage.

Finally, the FindAllMarkers() function in Seurat can produce unstable differential expression results in single-cell data due to dropout and variability. Consider switching to a pseudobulk approach for differential expression, aggregating counts per sample or donor before testing, which often yields more reliable enrichments.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3424 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6