Question: Why do PCA+tSNE/viSNE or PCA+UMAP, and not just tSNE/viSNE or UMAP on their own?
5
gravatar for Kevin Blighe
11 weeks ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

As per the title: why?

Does the preliminary PCA step merely maximise the chances of identifying the clusters that tSNE / UMAP later identify?

When doing this, in tutorials, it seems that people blindly choose a number of PCs as input to tSNE / UMAP, without checking the amount of variation explained by the PCs. For example, 20 PCs may only account for 30% overall variation.

Assuming one follows a standard procedure for, e.g., scRNA-seq, is this one step too many. A pipeline could be:

  1. normalisation of raw counts
  2. transformation of normalised counts
  3. PCA-transformation (may include an additional pre-scaling step)
  4. tSNE/UMAP

Kevin

ADD COMMENTlink modified 11 weeks ago by chris86290 • written 11 weeks ago by Kevin Blighe46k
5
gravatar for Devon Ryan
11 weeks ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

The primary reason for this given in the tSNE literature is for the sake of computational efficiency. See, for example, vanderMaaten 2008. In essence, tSNE requires pairwise comparison of datapoints, so it can be incredibly computationally taxing on scRNA-seq datasets unless the dimensionality undergoes an initial reduction.

I can't speak to UMAP, I'm not familiar enough with its inner-workings, but I presume the initial PCA is done for similar reasons.

ADD COMMENTlink written 11 weeks ago by Devon Ryan91k
2

You can get some computational speedup for UMAP by using PCA, but alternatively you can use a sparse matrix representation and get decent performance without having to use an intermediate PCA step.

ADD REPLYlink written 10 weeks ago by leland.mcinnes20
1

Thanks, indeed, in the paper, it states:

In all of our experiments, we start by using PCA to reduce the dimensionality of the data to 30. This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances.

ADD REPLYlink written 11 weeks ago by Kevin Blighe46k

So, even if it is computationally feasible it is still good to do PCA, because it suppresses noise?

ADD REPLYlink written 11 weeks ago by aln290

Yes, I presume that the reference to 'noise' implies that it improves the precision of identifying clusters, i.e., by removing variables that otherwise provide no information to the algorithm

ADD REPLYlink written 11 weeks ago by Kevin Blighe46k
2

Here is also important message from sctransform adaptation in Seurat tutorial:

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

Still, it will be nice to see more clearly how much this sole parameter (number of PCA) influences downstream analysis. But probably easier to do on synthetics data.

ADD REPLYlink written 11 weeks ago by aln290
4
gravatar for chris86
11 weeks ago by
chris86290
United Kingdom, London
chris86290 wrote:

At least from a clustering perspective, I'd probably try it both ways to be on the safe side, i.e. with PCA to get the top N PCs and without. I'm a bit skeptical of reducing to N PCs for clustering because there is inevitable information loss. The same will apply for t-SNE, UMAP, etc. I'd prefer to use the most variable genes instead.

Although I think it is less of a problem when just trying to visualise the data, rather than define some new cell types using cluster analysis where we might want to be a bit more cautious.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by chris86290
1

Thanks Chris for the input - makes sense!

ADD REPLYlink written 11 weeks ago by Kevin Blighe46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1189 users visited in the last hour