Yes, the analysis that you describe makes sense statistically. In single-cell RNA sequencing, clusters typically represent distinct cell populations or states. Comparing differential expression between subsets of cells from different clusters under the same condition (for example, Cluster1_WT versus Cluster2_WT) is equivalent to testing for gene expression differences between those populations within that condition. This is valid as long as each subset contains sufficient cells for reliable statistical testing, and the clustering is robust.
The potential issue is not statistical invalidity, but interpretation. If the clusters were identified using all cells (including both wild-type and mutant), the cluster assignments already account for condition-related differences to some extent. However, subsetting by condition and then comparing clusters isolates the comparison to condition-specific differences between cell populations.
In Seurat, you can perform this analysis without splitting the object and re-clustering, which risks altering cluster definitions. Instead, proceed as follows:
# Subset to wild-type cells
WT <- subset(YourSeuratObject, subset = Condition == "WT")
# Set cluster identities
Idents(WT) <- "seurat_clusters" # or your cluster column
# Run differential expression between Cluster1 and Cluster2
DE_WT <- FindMarkers(WT, ident.1 = "1", ident.2 = "2", test.use = "wilcox") # adjust test as needed
Repeat the process for mutant cells. This approach uses the original clustering while restricting to the condition of interest.
If cell numbers in subsets are low, consider pseudobulking (via aggregate expression across cells in each cluster-condition combination) before differential expression to improve power, but this is optional.
Kevin
Here are some past threads that would be useful :
Using Pseudobulk Approach for Identifying Marker Genes Within a Single Condition
scRNA-seq: How does cell number in clusters affect the number of DE genes?
Best choices for DGE and pathway enrichment analysis in single cell data using pseudobulk?
scRNAseq Differential expression analysis
https://bioconductor.org/books/3.21/OSCA.multisample/multi-sample-comparisons.html#creating-pseudo-bulk-samples
Thank you, I will check out the links. Although I'm not certain that pseudobulking is the issue here, I've run DE pseudobulking before. What I haven't tried is what the user is requesting, comparing a subset of Cluster1 vs a subset of cluster2.
Thanks for clarifying. I missed the "subset" part from your question. What criteria will you be using to subset the data? Can you share the basis of this odd request.
It's just as it is. Actually it's not Ctrl vs Treated - it's Wild Type vs Mutant. Almost each cluster contain an overlap of WT and Mutant cells.
The user just really really wants to compare cells Cluster1_WT vs Cluster2_WT, and same for Mutant.
I think I've come up with a possible solution. Split the seurat object into 2 - "WT" and "Mutant". Then run clustering separately for each object. After that the user can compare Cluster1 vs Cluster2 as much as they want.