Hi,
I am running trajectory analysis on a single cell dataset using Slingshot. I subsetted the original dataset to focus only in 2 cell types. I run PCA on the subsetted data and then use that for slingshot.
My issue is that, depending if I do the PCA with Seurat's RunPCA, or with scater's runPCA, the slingshot results change wildly, to the point of being incompatible.
This is the data and methods I have:
Seurat object
- 13k cells
- ~32k features
- SCT transformed
- ~21k features.
- 6 cell types - 19 clusters.
- 3 layers: counts, data, scale.data
Subset object
- 2 cell types - 6 clusters
And the code I use:
sub <- subset(sct, subset = cell_type %in% c("A", "B"))
### Seurat's PCA
sub <- RunPCA(sub, features = VariableFeatures(object = sub))
### RunUMAP and other details
...
### convert to SCE
sce.sub <- as.SingleCellExperiment(sub) ### built on the SCTransformed assay
### 3 assays: counts, logcounts, scaledata <- data became renamed to logcounts
colLabels(sce.sub) <-sce.sub$subtype
### Seurat's PCA is stored as "PCA"
### run scater's PCA and store it as "PCAv2"
sce.sub <- runPCA(sce.sub, subset_row=VariableFeatures(object = sub), name="PCAv2",
scale=F, ntop=Inf)
### RunUMAP and other details
...
### run Slingshot
sce.sling <- slingshot(sce.sub,
cluster=sce.sub$subtype,
reducedDim='PCA') ### or 'PCAv2'
I won't show the slingshot results because they get too complex.
I think it all lies in the PCA results being very different between the two methods:
> summary(reducedDims(sce.sub)[["PCA"]][,1]) ### Seurat's
Min. 1st Qu. Median Mean 3rd Qu. Max.
-62.913 -14.476 13.147 5.152 24.626 49.377
> summary(reducedDims(sce.sub)[["PCAv2"]][,1]) ### scater's
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.985 -6.428 -1.922 0.000 6.214 26.375
I tried running scater::runPCA with the parameters that most closely resemble Seurat's RunPCA, but I am still getting the same issues.
What might be going on here? Is this because the data is SCTtransformed and scater/SingleCellExperiment doesn't like that? Is either method scaling or centering the data while the other doesn't?
Thanks for your input.
Yes, the cell types are closely related. Both cell types have already been identified in multiple single cell papers. How they are related is what we are trying to understand.
All the pre-processing, filtering, clustering, marker identification, etc... was done in Seurat. Before
subset
. I ran again Seurat's PCA on the new object after subsetting the data.I did not re-run the clustering in the SingleCellExperiment object or anything else. I just inherited the metadata from the Seurat object. And then ran PCAv2.
AFAIK, both the Seurat and the SCE object are using the same assay and data:
... are
Seurat::RunPCA
andscater::runPCA
using both the default assay and layer? (SCT-data in seurat / SCT-logcounts in SCE).The variable features selected should be identical, because I explicitly called to the same function in both PCAs:
Aren't these two options equivalent?
As for the loadings. Converting the Seurat object into SCE does not transfer the feature loadings. I retrieved them both using:
It seems that scater removed a few (42) genes, but the rest are the same in both methods:
And then I compared them for PCs 1-4:
Seurat's PCA loadings are clearly more dispersed, and with outliers. Most of these genes are very relevant for the cell types studied.
The UMAPs are obviously different, because they come from PCAs with different results. However, when I say that the
slingshot
results are different, I am looking at the distribution of the pseudotime values to each cell / subtype / cell type, and how they are assigned to branches. For example, for one specific branch, in the slingshot results derived from Seurat's PCA, the subtypes are ordered like:A1 > A2 > B1 > A3
. However, the equivalent branch in scater's PCA results hasB1 > A1 > A2 > (a bit of B1 again) > A3
.I did previously some exploration of the slingshot results using scater's PCA, and I could find some "structure" in the data in PC3, driving the proposed trajectories. And the plot of PC3's loadings shows that they are pretty different between PCA methods.
I spent quite a long time analyzing slingshot results using scater::runPCA. We even made some convincing tests and had interesting biological explanations. However, now I see that those results disappear if I don't re-run the PCA after converting to SCE before slingshot.
What is happening here? Which method is right and which is wrong? Are scater::runPCA's underlying assumptions/method incompatible with SCT transformation?
Again, thanks for your input.
Potentially. Could you try doing a dummy test without SCT, just one sample with a few cells, log transform and scaling.
Hi,
It took me a while, but I finally arrived at the end of this (after a ton of debugging and tracing the calls).
Both Seurat and scater PCA methods end up calling, by default, to
irlba::irlba()
. However, they differ in their calls:scale.data
layer. And it disables centering (center = NULL
).logcounts
layer (=data
in Seurat objects). It is possible to manually change the layer used withexprs_values = "scaledata"
. This function also forces the usage of centering of the data. I wasn't able to find how to disable that.I tested this in both my data and the pbmc3k dataset, using the
RNA
slot:pbmc3k first 4 PCs using each function's default layer (scale.data in Seurat's, data in scater's):
And pbmc3k first 4 PCs using manually setting
exprs_values = "scaledata"
in scater::runPCA. Enabling/disabling centering wouldn't have any effect because scale.data is already virtually centered (rowMeans ~< 1e-10).Thus, mystery solved!
Still, the important question remains. Seurat forces the use of scale.data. And, after using SCTransform, they explicitly state that PCA should be run on the Pearson's residuals (scale.data). However, when running other downstream analyses based also on PCA, e.g.,
slingshot
, does the use of one or other PCA method/philosophy violate the assumptions of the algorithms used in downstream analyses?Thanks for the troubleshooting !
The data in scale.data is already centered so no need to center again.
If the logcounts layer is used, it needs to be centered and scaled to run PCA.
scater::runPCA(sce, scale=TRUE)
SingleCellExperiment
generally gives more freedom to the user thanSeurat
which allow you to scale or transform your data as you please before running PCA.But one needs to center around 0 and scale their normalized count matrix before running a PCA. The scaling will prevent few outliers variables from driving each unique principal component.
I found this thread really helpful.