Differeces between Seurat::RunPCA and scater::runPCA affect downstream analyses
1
0
Entering edit mode
4 months ago
txema.heredia ▴ 250

Hi,

I am running trajectory analysis on a single cell dataset using Slingshot. I subsetted the original dataset to focus only in 2 cell types. I run PCA on the subsetted data and then use that for slingshot.

My issue is that, depending if I do the PCA with Seurat's RunPCA, or with scater's runPCA, the slingshot results change wildly, to the point of being incompatible.

This is the data and methods I have:

  • Seurat object

    • 13k cells
    • ~32k features
    • SCT transformed
      • ~21k features.
      • 6 cell types - 19 clusters.
      • 3 layers: counts, data, scale.data
  • Subset object

    • 2 cell types - 6 clusters

And the code I use:

sub <- subset(sct, subset = cell_type %in% c("A", "B"))

### Seurat's PCA
sub <- RunPCA(sub, features = VariableFeatures(object = sub)) 

### RunUMAP and other details
...
### convert to SCE
sce.sub <- as.SingleCellExperiment(sub)  ### built on the SCTransformed assay
                                         ### 3 assays: counts, logcounts, scaledata <- data became renamed to logcounts
colLabels(sce.sub) <-sce.sub$subtype

### Seurat's PCA is stored as "PCA"
### run scater's PCA and store it as "PCAv2"
sce.sub <- runPCA(sce.sub, subset_row=VariableFeatures(object = sub), name="PCAv2",
                   scale=F, ntop=Inf)

### RunUMAP and other details
...
### run Slingshot
sce.sling <- slingshot(sce.sub, 
                    cluster=sce.sub$subtype,
                    reducedDim='PCA') ### or 'PCAv2'

I won't show the slingshot results because they get too complex.

I think it all lies in the PCA results being very different between the two methods:

> summary(reducedDims(sce.sub)[["PCA"]][,1]) ### Seurat's
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-62.913 -14.476  13.147   5.152  24.626  49.377  

> summary(reducedDims(sce.sub)[["PCAv2"]][,1]) ### scater's
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-13.985  -6.428  -1.922   0.000   6.214  26.375 

I tried running scater::runPCA with the parameters that most closely resemble Seurat's RunPCA, but I am still getting the same issues.

What might be going on here? Is this because the data is SCTtransformed and scater/SingleCellExperiment doesn't like that? Is either method scaling or centering the data while the other doesn't?

slingshot PCA Seurat scater single-cell • 1.1k views
ADD COMMENT
0
Entering edit mode
4 months ago

I would inspect the loadings of the PCAs rather than a strict PC1 vs. PC1v2. Check first if the genes driving PC1 are the same as the ones driving PC1v2.

From you chunk of code we can only trust you on the default assay selected as well as the variable features selection.

Before even getting into slingshot results, are your clusters and UMAPs comparable between Seurat and scater ?

Note : Are your 2 cell types from the same lineage ? If you are doing trajectory analysis on complete different cell types the trajectory will make no sense.

ADD COMMENT
0
Entering edit mode

Thanks for your input.

Yes, the cell types are closely related. Both cell types have already been identified in multiple single cell papers. How they are related is what we are trying to understand.

All the pre-processing, filtering, clustering, marker identification, etc... was done in Seurat. Before subset. I ran again Seurat's PCA on the new object after subsetting the data.

I did not re-run the clustering in the SingleCellExperiment object or anything else. I just inherited the metadata from the Seurat object. And then ran PCAv2.

From you chunk of code we can only trust you on the default assay selected as well as the variable features selection.

AFAIK, both the Seurat and the SCE object are using the same assay and data:

### Seurat
> sub
An object of class Seurat 
52986 features across 9084 samples within 2 assays 
Active assay: SCT (20701 features, 5000 variable features)   <------- SCT
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 7 dimensional reductions calculated: pca, umap.unintegrated, integrated.cca, umap, umap.SCT, tsne, ...

### SCE
> sce.sub
class: SingleCellExperiment 
dim: 20701 9084 
metadata(0):
assays(3): counts logcounts scaledata
rownames(20701): Xkr4 Gm1992 ... CAAA01147332.1 AC149090.1
rowData names(0):
colnames(9084): .....
colData names(64): orig.ident nCount_RNA ... ident label
reducedDimNames(7): PCA UMAP.SCT ... UMAP PCAv2
mainExpName: SCT    <-------- SCT
altExpNames(1): RNA

... are Seurat::RunPCA and scater::runPCA using both the default assay and layer? (SCT-data in seurat / SCT-logcounts in SCE).

The variable features selected should be identical, because I explicitly called to the same function in both PCAs:

Seurat's PCA
sub <- RunPCA(sub, features = VariableFeatures(object = sub))

SCE + scater
sce.sub <- runPCA(sce.sub, subset_row=VariableFeatures(object = sub), name="PCAv2", scale=F, ntop=Inf )

Aren't these two options equivalent?

As for the loadings. Converting the Seurat object into SCE does not transfer the feature loadings. I retrieved them both using:

### Seurat
> pca.seurat.loadings <- Loadings(sub, reduction = "pca")
### SCE + scater
> pca.scater.loadings <- attr(reducedDims(sce.sub)[["PCAv2"]],"rotation")

It seems that scater removed a few (42) genes, but the rest are the same in both methods:

> dim(pca.seurat.loadings)
[1] 5000   51
> dim(pca.scater.loadings)
[1] 4958   51
> sum(! pca.scater.loadings$gene %in% pca.seurat.loadings$gene)
[1] 0

And then I compared them for PCs 1-4:

feature loadings PCs 1-4

Seurat's PCA loadings are clearly more dispersed, and with outliers. Most of these genes are very relevant for the cell types studied.

The UMAPs are obviously different, because they come from PCAs with different results. However, when I say that the slingshot results are different, I am looking at the distribution of the pseudotime values to each cell / subtype / cell type, and how they are assigned to branches. For example, for one specific branch, in the slingshot results derived from Seurat's PCA, the subtypes are ordered like: A1 > A2 > B1 > A3. However, the equivalent branch in scater's PCA results has B1 > A1 > A2 > (a bit of B1 again) > A3.

I did previously some exploration of the slingshot results using scater's PCA, and I could find some "structure" in the data in PC3, driving the proposed trajectories. And the plot of PC3's loadings shows that they are pretty different between PCA methods.

I spent quite a long time analyzing slingshot results using scater::runPCA. We even made some convincing tests and had interesting biological explanations. However, now I see that those results disappear if I don't re-run the PCA after converting to SCE before slingshot.

What is happening here? Which method is right and which is wrong? Are scater::runPCA's underlying assumptions/method incompatible with SCT transformation?

Again, thanks for your input.

ADD REPLY
0
Entering edit mode

Are scater::runPCA's underlying assumptions/method incompatible with SCT transformation?

Potentially. Could you try doing a dummy test without SCT, just one sample with a few cells, log transform and scaling.

ADD REPLY
1
Entering edit mode

Hi,

It took me a while, but I finally arrived at the end of this (after a ton of debugging and tracing the calls).

Both Seurat and scater PCA methods end up calling, by default, to irlba::irlba(). However, they differ in their calls:

  • Seurat::RunPCA forces the usage of the scale.data layer. And it disables centering (center = NULL).
  • scater::runPCA uses by default the SingleCellExperiment object logcounts layer (= data in Seurat objects). It is possible to manually change the layer used with exprs_values = "scaledata". This function also forces the usage of centering of the data. I wasn't able to find how to disable that.

I tested this in both my data and the pbmc3k dataset, using the RNA slot:

pbmc3k first 4 PCs using each function's default layer (scale.data in Seurat's, data in scater's):

pbmc3k first 4 PCs using each function's default assay

And pbmc3k first 4 PCs using manually setting exprs_values = "scaledata" in scater::runPCA. Enabling/disabling centering wouldn't have any effect because scale.data is already virtually centered (rowMeans ~< 1e-10).

pbmc3k first 4 PCs using scale.data

Thus, mystery solved!

Still, the important question remains. Seurat forces the use of scale.data. And, after using SCTransform, they explicitly state that PCA should be run on the Pearson's residuals (scale.data). However, when running other downstream analyses based also on PCA, e.g., slingshot, does the use of one or other PCA method/philosophy violate the assumptions of the algorithms used in downstream analyses?

ADD REPLY
0
Entering edit mode

Thanks for the troubleshooting !

Seurat::RunPCA forces the usage of the scale.data layer. And it disables centering (center = NULL)

The data in scale.data is already centered so no need to center again.

scater::runPCA uses by default the SingleCellExperiment object logcounts layer (= data in Seurat objects). It is possible to manually change the layer used with exprs_values = "scaledata". This function also forces the usage of centering of the data

If the logcounts layer is used, it needs to be centered and scaled to run PCA. scater::runPCA(sce, scale=TRUE)

SingleCellExperiment generally gives more freedom to the user than Seurat which allow you to scale or transform your data as you please before running PCA.

But one needs to center around 0 and scale their normalized count matrix before running a PCA. The scaling will prevent few outliers variables from driving each unique principal component.

I found this thread really helpful.

ADD REPLY

Login before adding your answer.

Traffic: 3436 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6