How many PCs should be considered for downstream analyses?
1
3
Entering edit mode
2.3 years ago

Hi All,

I have two groups WT and KO.

As per the Jackstraw plot, ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line).

1. How to interpret the JackStraw plot. How come even the PCs with p-value =1 is above the dashed line.

2. PC 5 has pvalue "1". Do I need to consider the PCs which has only pvalue <0.05 (PC : 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15) for the downstream analyses?

As per the Elbow plot, looks like at PC 34 the standard deviation is touching the ground and staying constant.

So how many PCs should I consider for the downstream analyses like (find neighbors, find clusters and UMAP)?

cond_integrated <- FindNeighbors(object = cond_integrated, dims = ?)
cond_integrated <- FindClusters(object = cond_integrated)
cond_integrated <- RunUMAP(cond_integrated, reduction = "pca", dims = ?)


As I change the number of dimensions each time, I am getting different UMAP clustering.

merged_cond <- merge(x = WT_seurat_obj, y = KO_seurat_obj, add.cell.id = c("WT","KO"))

# filtered the merged_con based on mito, etc
filtered_cond_seurat

# split seurat object by condition from filtered_cond_seurat

for (i in 1:length(split_cond)) {
split_cond[[i]] <- NormalizeData
split_cond[[i]] <- CellCycleScoring
split_cond[[i]] <- SCTransform
}
Obtained integ_features from SelectIntegrationFeatures using split_cond seurat object
Obtained anchor features using PrepSCTIntegration
Obtained integ_anchors using FindIntegrationAnchors and SCT normalization method
Obtained cond_integrated seurat object using IntegrateData

cond_integrated <- RunPCA(object = cond_integrated)

DimHeatmap(cond_integrated, dims = 1:15, cells = 500, balanced = TRUE)

cond_integrated <- JackStraw(cond_integrated, num.replicate = 100, dims=50)

cond_integrated <- ScoreJackStraw(cond_integrated, dims = 1:50)

JackStrawPlot(cond_integrated, dims = 1:50)

ElbowPlot(object = cond_integrated, ndims = 50)


scRNAseq PCA UMAP Clustering RNAseq • 5.6k views
0
Entering edit mode

Want to revive this thread. I have a situation, when a lower number of PCs seems to give me more "biologically relevant" results, does it justify using a lower number of PCs?

I have following setup: several time points of cell differentiation protocol, but all represent different libraries (I know that it is far from ideal setup, but on the one hand it was made like this due to complicated protocol on wet side and on the other it should not prevent me from analysing each individual time point first and then try to make a between-point connection based on obtained biological prior-knowledge)

I'm performing UMAP dimreduction on a subset of my data, to see the overall structure. I've noticed that a low number of PCs (5) provide better time point-to-time point clusterization than a higher number of PCs (15). That is probably due to the batch effect, getting amplified with higher numbers of PCs. Would in that case be meaningful to use a low number of PCs? And maybe additionally perform clusterization with a higher number of PCs in each individual timepoint later on?

Best, Eugene

Elbow plot 5 PCs 15 PCs

0
Entering edit mode

0
Entering edit mode

I'm performing UMAP dimreduction on a subset of my data, to see the overall structure. I've noticed that a low number of PCs (5) provide better time point-to-time point clusterization than a higher number of PCs (15).

Not sure what you are looking at to come up with this assessment, but the eyeball test says that clustering is much better with 15 PCs. Not only are clusters better separated globally, but red and blue are better separated locally, as are cyan and magenta groups.

0
Entering edit mode

The point here is that I have some prior knowledge of what these cells are and as far as these populations are on the way of differentiation trajectories it is safe to assume that consecutive days have to be somehow closer to one another than more distinct time points. And that is exactly what I see with 5 PCs. On the other hand with 15 all timepoints just scattered across the umaps components.

1
Entering edit mode

And my point is that they are not supposed to be arranged in any kind of trajectory that mimics their differentiation pattern. They are supposed to be well separated, which they are with 15 PCs. You are expecting too much from dimensionality reduction if you think that it is going to recapitulate the differentiation pattern.

There aren't 6 expected clusters with 5 PCs - there are 4 at most. If you didn't know their colors ahead of time, there is no way you'd be able to come up with a correct number of clusters. On the other hand, 15 PCs is much more informative regarding real clusters, though I wouldn't necessarily guess 6 either if all dots were uniformly colored.

5
Entering edit mode
2.3 years ago

When using SCTransform, this matters somewhat less, as it tends to be more robust and handle noise better. As such, you can provide a lot of PCs without introducing undue variation. I generally start with 30, but have gone up to 50 and noticed little difference. The authors generally recommend using more than the standard workflow for reasons outlined in the SCTransform vignette:

Why can we choose more PCs when using sctransform?

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

In addition, sctransform returns 3,000 variable features by default, instead of 2,000. The rationale is similar, the additional variable features are less likely to be driven by technical differences across cells, and instead may represent more subtle biological fluctuations. In general, we find that results produced with sctransform are less dependent on these parameters (indeed, we achieve nearly identical results when using all genes in the transcriptome, though this does reduce computational efficiency).

In short, use more than you think you need and try not to overthink it.

0
Entering edit mode

Can you please look into this post of mine related to this analysis? Downsampling one of the sample on the UMAP clustering to match the number of cells of the other group

I was playing around by changing the number of PCs from 15 to 40. I have attached them below for reference. The overall clustering pattern didn't change drastically. However, there is some change in the orientation of the clustering in each of the UMAP.

2
Entering edit mode

However, there is some change in the orientation of the clustering in each of the UMAP.

The orientation change is meaningless as the embedding is stochastic. The overall distribution of clusters is what matters, and that seems to be very similar in all cases.

0
Entering edit mode

Thanks, Mensur for adding a point. Can you please look into this post of mine related to this analysis? Downsampling one of the sample on the UMAP clustering to match the number of cells of the other group