Hello ,
I’m working on a single-cell RNA-seq dataset where I want to compare pathway activity across different cell groups (e.g., tumor, immune, other) within each tumor sample. These cell groups each include multiple clusters (e.g., macrophages, NK cells for the immune group). I would like to use GSVA or ssGSEA for pathway analysis. I am not sure which algorithm is suitable.
I’ve tried two approaches:
- Run GSVA on the full matrix using
GetAssayData(seurat_obj, slot = "scale.data")
then perform grouping in post-analysis visualization with a heatmap. - Aggregate expression using
AggregateExpression()
byorig.ident
andcell_group
, then run GSVA on the resulting matrix, but I’m unsure if it's the correct approach.
Edit: I have not been able to perform the first method, but I have been able to perform the second. I am still not sure which one is more appropriate. Isn't performing pseudo bulking defeat the purpose of scRNASeq, since you return to a bulk-like state?
Previously, I have used the ssGSEA parameter on the whole matrix obtained from GetAssay
function, however as I understand it, the algorithm will perform the analysis on every single cell of the matrix. The analysis was terminated because it hasn't finished after nearly one week of processing. I am sure I am doing something wrong but I am not sure why. This method was inspired by this GitHub response suggesting the GetAssay
option.
The biological question is this: what are the different pathways expressed inside each cell groups (tumor, immune, other) where each cell group contains multiple clusters of celltypes (macrophages, NK cells for immune group for example), for each tumor sample? These cell groups are chosen based on their previous cluster identification.
Because online guides on GSVA pertaining to single-cell are so scarce, I am not sure whether it is appropriate to make our cell groups before or after the GSVA analysis and whether or not to input the matrix of pseudo-bulk genes obtained from the AggregateExpression
function, or the raw scaled.data
assay situated which is normalized and scaled previously in a previous Seurat process, obtaining it from the GetAssayData(layer = "scale.data")
function.
Here the relevant code:
object <- GetAssayData(object = seurat_obj, assay = "RNA", layer = "scale.data")
# ssgsea_object <- ssgseaParam(object, geneSets = gene_sets)
gsva_object <- gsvaParam(object, geneSets = gene_sets)
gsva_results <- GSVA::gsva(param = gsva_object)
Another alternative is using the matrix obtained from pseudobulk using AggregateExpression such as:
avg_exp_group <- AggregateExpression(seurat_obj, group.by = c("orig.ident", "cell_group"))
gsva_object <- gsvaParam(avg_exp_group$RNA, geneSets = gene_sets)
gsva_results <- GSVA::gsva(param = gsva_object, expr = object, gset.idx.list = gene_sets)
I am not sure which way is the correct way or if there is a methodological misstep in any of those since the official GSVA vignette doesn't really mention anything about scRNA-Seq. Asking chatGPT ironically gives two different responses on two different computers, one suggesting the first and another the second approach. Also just to be sure, should I use the counts or the scaled.data in layers?
For the list of genes:
# HALLMARK: HYPOXIA
hallmark <- msigdbr(species = species, collection = "H")
hallmark_hypoxia <- hallmark %>% filter(gs_name == "HALLMARK_HYPOXIA")
[...]
# Assuming I have multiple gene sets from msigdbr
gene_sets <- list(
HALLMARK_HYPOXIA = unique(hallmark_hypoxia$gene_symbol),
HALLMARK_GLYCOLYSIS = unique(hallmark_glycolysis$gene_symbol)
)
Many thanks in advance, Minh-Anh
Edit2: It turns out I have wrongly taken inspiration from the GitHub code:
GSVA::gsva(expr = object, gset.idx.list = geneset, ...)
The function arguments have changed, thus the arguments were inappropriate and the function never resolved. I have now been able to perform both methods. However, I would still like to ask which method is more appropriate, inputting the full matrix (from data or scaled.data) or the pseudobulk matrix into the gsva function, and if gsva or ssgsea is more appropriate.