Question

Why do PCA+tSNE/viSNE or PCA+UMAP, and not just tSNE/viSNE or UMAP on their own?

8

Entering edit mode

4.9 years ago

Kevin Blighe 87k

As per the title: why?

Does the preliminary PCA step merely maximise the chances of identifying the clusters that tSNE / UMAP later identify?

When doing this, in tutorials, it seems that people blindly choose a number of PCs as input to tSNE / UMAP, without checking the amount of variation explained by the PCs. For example, 20 PCs may only account for 30% overall variation.

Assuming one follows a standard procedure for, e.g., scRNA-seq, is this one step too many. A pipeline could be:

normalisation of raw counts
transformation of normalised counts
PCA-transformation (may include an additional pre-scaling step)
tSNE/UMAP

Kevin

tsne umap pca visne dimension reduction • 11k views

ADD COMMENT • link updated 4.9 years ago by chris86 ▴ 400 • written 4.9 years ago by Kevin Blighe 87k

score 8 · Accepted Answer · 2019-05-29

8

Entering edit mode

4.9 years ago

Devon Ryan 104k

The primary reason for this given in the tSNE literature is for the sake of computational efficiency. See, for example, vanderMaaten 2008. In essence, tSNE requires pairwise comparison of datapoints, so it can be incredibly computationally taxing on scRNA-seq datasets unless the dimensionality undergoes an initial reduction.

I can't speak to UMAP, I'm not familiar enough with its inner-workings, but I presume the initial PCA is done for similar reasons.

ADD COMMENT • link 4.9 years ago by Devon Ryan 104k

3

Entering edit mode

You can get some computational speedup for UMAP by using PCA, but alternatively you can use a sparse matrix representation and get decent performance without having to use an intermediate PCA step.

ADD REPLY • link 4.9 years ago by leland.mcinnes ▴ 30

1

Entering edit mode

Thanks, indeed, in the paper, it states:

In all of our experiments, we start by using PCA to reduce the dimensionality of the data to 30. This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

So, even if it is computationally feasible it is still good to do PCA, because it suppresses noise?

ADD REPLY • link 4.9 years ago by aln ▴ 320

0

Entering edit mode

Yes, I presume that the reference to 'noise' implies that it improves the precision of identifying clusters, i.e., by removing variables that otherwise provide no information to the algorithm

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

2

Entering edit mode

Here is also important message from sctransform adaptation in Seurat tutorial:

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

Still, it will be nice to see more clearly how much this sole parameter (number of PCA) influences downstream analysis. But probably easier to do on synthetics data.

ADD REPLY • link 4.9 years ago by aln ▴ 320

score 6 · Accepted Answer · 2019-05-29

6

Entering edit mode

4.9 years ago

chris86 ▴ 400

At least from a clustering perspective, I'd probably try it both ways to be on the safe side, i.e. with PCA to get the top N PCs and without. I'm a bit skeptical of reducing to N PCs for clustering because there is inevitable information loss. The same will apply for t-SNE, UMAP, etc. I'd prefer to use the most variable genes instead.

Although I think it is less of a problem when just trying to visualise the data, rather than define some new cell types using cluster analysis where we might want to be a bit more cautious.