Question: Why do PCA+tSNE/viSNE or PCA+UMAP, and not just tSNE/viSNE or UMAP on their own?
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe61k
University College London
Kevin Blighe61k wrote:

As per the title: why?

Does the preliminary PCA step merely maximise the chances of identifying the clusters that tSNE / UMAP later identify?

When doing this, in tutorials, it seems that people blindly choose a number of PCs as input to tSNE / UMAP, without checking the amount of variation explained by the PCs. For example, 20 PCs may only account for 30% overall variation.

Assuming one follows a standard procedure for, e.g., scRNA-seq, is this one step too many. A pipeline could be:

  1. normalisation of raw counts
  2. transformation of normalised counts
  3. PCA-transformation (may include an additional pre-scaling step)
  4. tSNE/UMAP


ADD COMMENTlink modified 13 months ago by chris86340 • written 13 months ago by Kevin Blighe61k
gravatar for Devon Ryan
13 months ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

The primary reason for this given in the tSNE literature is for the sake of computational efficiency. See, for example, vanderMaaten 2008. In essence, tSNE requires pairwise comparison of datapoints, so it can be incredibly computationally taxing on scRNA-seq datasets unless the dimensionality undergoes an initial reduction.

I can't speak to UMAP, I'm not familiar enough with its inner-workings, but I presume the initial PCA is done for similar reasons.

ADD COMMENTlink written 13 months ago by Devon Ryan96k

You can get some computational speedup for UMAP by using PCA, but alternatively you can use a sparse matrix representation and get decent performance without having to use an intermediate PCA step.

ADD REPLYlink written 13 months ago by leland.mcinnes30

Thanks, indeed, in the paper, it states:

In all of our experiments, we start by using PCA to reduce the dimensionality of the data to 30. This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances.

ADD REPLYlink written 13 months ago by Kevin Blighe61k

So, even if it is computationally feasible it is still good to do PCA, because it suppresses noise?

ADD REPLYlink written 13 months ago by aln290

Yes, I presume that the reference to 'noise' implies that it improves the precision of identifying clusters, i.e., by removing variables that otherwise provide no information to the algorithm

ADD REPLYlink written 13 months ago by Kevin Blighe61k

Here is also important message from sctransform adaptation in Seurat tutorial:

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

Still, it will be nice to see more clearly how much this sole parameter (number of PCA) influences downstream analysis. But probably easier to do on synthetics data.

ADD REPLYlink written 13 months ago by aln290
gravatar for chris86
13 months ago by
United Kingdom, London
chris86340 wrote:

At least from a clustering perspective, I'd probably try it both ways to be on the safe side, i.e. with PCA to get the top N PCs and without. I'm a bit skeptical of reducing to N PCs for clustering because there is inevitable information loss. The same will apply for t-SNE, UMAP, etc. I'd prefer to use the most variable genes instead.

Although I think it is less of a problem when just trying to visualise the data, rather than define some new cell types using cluster analysis where we might want to be a bit more cautious.

ADD COMMENTlink modified 13 months ago • written 13 months ago by chris86340

Thanks Chris for the input - makes sense!

ADD REPLYlink written 13 months ago by Kevin Blighe61k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2002 users visited in the last hour