Hi all,
Looking for some guidance in subclustering. I have 6 samples that total to approx. 30,000 cells that are integrated into one Seurat object (after using SCTransform individually on each sample and subsequently merging them together -- after merging, I went through a typical workflow of RunPCA() --> RunHarmony() --> RunUMAP() --> FindNeighbors() to batch correct and generate the integrated UMAP). So far, everything's worked out well with good clustering, annotation, differential expression analysis (using DESeq2), etc. However, I'm a little unclear on exactly how to subcluster. What I've done is:
# Generating the integrated Seurat object
merged.sets <- merge(sample_1, y = c(sample_2, sample_3, sample_3, sample_5, sample_6), add.cell.ids = c("control", "control", "control", "disease", "disease", "disease"), project = "prelim_analysis")
features <- SplitObject(merged.sets, split.by = "sample_id")
features <- lapply(X = features,
FUN = SCTransform,
method = "glmGamPoi",
vars.to.regress = "percent.mt",
return.only.var.genes = FALSE)
var.features <- SelectIntegrationFeatures(object.list = features, nfeatures = 3000)
seurat.object <- merge(x = features[[1]], y = features[2:length(features)], merge.data=TRUE)
VariableFeatures(seurat.object) <- var.features
# Integrated object with 6 samples
seurat.object
# Downstream analysis, including clustering and annotation (not showing code here)
# Subsetting one cluster
cluster_1 <- subset(seurat.object, idents = "cluster_1")
# Changing the default assay from SCT to RNA
DefaultAssay(cluster_1) <- "RNA
# Going through the guided-clustering tutorial's workflow
cluster_1 <- NormalizeData(cluster_1)
cluster_1 <- FindVariableFeatures(cluster_1)
cluster_1 <- ScaleData(cluster_1, vars.to.regress = "percent.mt")
cluster_1 <- RunPCA(cluster_1)
# Elbow plot to identify dimensionality
ElbowPlot(cluster_1)
# Use the elbow from above to subcluster the cells
cluster_1 <- FindNeighbors(cluster_1, dims = 1:10)
cluster_1 <- FindClusters(cluster_1, resolution = 0.2)
cluster_1 <- RunUMAP(cluster_1, dims = 1:10)
DimPlot(cluster_1, reduction = "umap")
# Then, perform downstream analyses with FindAllMarkers(), etc.
The questions I have are:
Should I use the above workflow or should I SCTransform the subsetted cells of cluster_1 (after changing default assay back to "RNA")? I wasn't quite sure whether to stick with SCTransform and it's considered a "no-no" to use the older scaling workflow instead.
If it is recommended to stick with SCTransform, should I replace NormalizeData(), ScaleData(), and FindVariableFeatures() with SCTransform(cluster_1, vars.to.regress = "percent.mt") and go through the SCTransform vignette's workflow? My sense is that I should not use the SCT data that's stored after subsetting and to re-do the SCTransform part (since I should identify variable genes between cells within the same cluster after subsetting, which should be different than the entire dataset).
I'm finding that I have to set my resolution low (around 0.1 or 0.2) at the FindClusters() part, which results in subclustering into 3-4 clusters. If I increase the resolution even slightly, however, I'm getting like 10 subclusters back, which biologically does not make sense to me - since I have some idea of what to expect biologically, I've been keeping the resolution low instead. Is that normal? Or unusual? For example, it doesn't make sense to expect 12 different subclusters of macrophages, but 3 to 4 makes sense biologically in the disease I'm studying. I'm trying to use biological knowledge to drive the computational parameters here.
I've been having trouble with the FindSubcluster() function and there isn't much guidance there, which is the reason I decided on the above workflow. Thanks all, I appreciate the help. I did try to find help both here and on the Seurat GitHub, but there unfortunately is not as much direction on subclustering. At this point, I've spent so many hours trying to figure this out that I'm a bit at my wit's end.