As per the title: why?
Does the preliminary PCA step merely maximise the chances of identifying the clusters that tSNE / UMAP later identify?
When doing this, in tutorials, it seems that people blindly choose a number of PCs as input to tSNE / UMAP, without checking the amount of variation explained by the PCs. For example, 20 PCs may only account for 30% overall variation.
Assuming one follows a standard procedure for, e.g., scRNA-seq, is this one step too many. A pipeline could be:
- normalisation of raw counts
- transformation of normalised counts
- PCA-transformation (may include an additional pre-scaling step)
- tSNE/UMAP
Kevin
You can get some computational speedup for UMAP by using PCA, but alternatively you can use a sparse matrix representation and get decent performance without having to use an intermediate PCA step.
Thanks, indeed, in the paper, it states:
So, even if it is computationally feasible it is still good to do PCA, because it suppresses noise?
Yes, I presume that the reference to 'noise' implies that it improves the precision of identifying clusters, i.e., by removing variables that otherwise provide no information to the algorithm
Here is also important message from sctransform adaptation in Seurat tutorial:
Still, it will be nice to see more clearly how much this sole parameter (number of PCA) influences downstream analysis. But probably easier to do on synthetics data.