As per the title: why?
Does the preliminary PCA step merely maximise the chances of identifying the clusters that tSNE / UMAP later identify?
When doing this, in tutorials, it seems that people blindly choose a number of PCs as input to tSNE / UMAP, without checking the amount of variation explained by the PCs. For example, 20 PCs may only account for 30% overall variation.
Assuming one follows a standard procedure for, e.g., scRNA-seq, is this one step too many. A pipeline could be:
- normalisation of raw counts
- transformation of normalised counts
- PCA-transformation (may include an additional pre-scaling step)