Question: Why do PCA+tSNE/viSNE or PCA+UMAP, and not just tSNE/viSNE or UMAP on their own?
6
gravatar for Kevin Blighe
6 months ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

As per the title: why?

Does the preliminary PCA step merely maximise the chances of identifying the clusters that tSNE / UMAP later identify?

When doing this, in tutorials, it seems that people blindly choose a number of PCs as input to tSNE / UMAP, without checking the amount of variation explained by the PCs. For example, 20 PCs may only account for 30% overall variation.

Assuming one follows a standard procedure for, e.g., scRNA-seq, is this one step too many. A pipeline could be:

  1. normalisation of raw counts
  2. transformation of normalised counts
  3. PCA-transformation (may include an additional pre-scaling step)
  4. tSNE/UMAP

Kevin

ADD COMMENTlink modified 6 months ago by chris86300 • written 6 months ago by Kevin Blighe52k
5
gravatar for Devon Ryan
6 months ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

The primary reason for this given in the tSNE literature is for the sake of computational efficiency. See, for example, vanderMaaten 2008. In essence, tSNE requires pairwise comparison of datapoints, so it can be incredibly computationally taxing on scRNA-seq datasets unless the dimensionality undergoes an initial reduction.

I can't speak to UMAP, I'm not familiar enough with its inner-workings, but I presume the initial PCA is done for similar reasons.

ADD COMMENTlink written 6 months ago by Devon Ryan93k
3

You can get some computational speedup for UMAP by using PCA, but alternatively you can use a sparse matrix representation and get decent performance without having to use an intermediate PCA step.

ADD REPLYlink written 6 months ago by leland.mcinnes30
1

Thanks, indeed, in the paper, it states:

In all of our experiments, we start by using PCA to reduce the dimensionality of the data to 30. This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances.

ADD REPLYlink written 6 months ago by Kevin Blighe52k

So, even if it is computationally feasible it is still good to do PCA, because it suppresses noise?

ADD REPLYlink written 6 months ago by aln290

Yes, I presume that the reference to 'noise' implies that it improves the precision of identifying clusters, i.e., by removing variables that otherwise provide no information to the algorithm

ADD REPLYlink written 6 months ago by Kevin Blighe52k
2

Here is also important message from sctransform adaptation in Seurat tutorial:

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

Still, it will be nice to see more clearly how much this sole parameter (number of PCA) influences downstream analysis. But probably easier to do on synthetics data.

ADD REPLYlink written 6 months ago by aln290
4
gravatar for chris86
6 months ago by
chris86300
United Kingdom, London
chris86300 wrote:

At least from a clustering perspective, I'd probably try it both ways to be on the safe side, i.e. with PCA to get the top N PCs and without. I'm a bit skeptical of reducing to N PCs for clustering because there is inevitable information loss. The same will apply for t-SNE, UMAP, etc. I'd prefer to use the most variable genes instead.

Although I think it is less of a problem when just trying to visualise the data, rather than define some new cell types using cluster analysis where we might want to be a bit more cautious.

ADD COMMENTlink modified 6 months ago • written 6 months ago by chris86300
1

Thanks Chris for the input - makes sense!

ADD REPLYlink written 6 months ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 795 users visited in the last hour