Question

Slingshot: how do you choose which type of dimensionality reduction to use it with?

5

Entering edit mode

4.7 years ago

Friederike 8.9k

A typical analysis that is part of many scRNA-seq workflows is to order the cells along a "pseudotime" trajectory and infer lineages, branching points etc. Slingshot has consistently been shown to be a well-performing tool for that task (e.g. here), which determines the lineages either based on the entire expression matrix (not recommended) or by just looking at a lower-dimensional space, e.g. PCA. The slingshot paper makes a point out of the fact that slingshot can be fed any cell coordinates in the form of various dimensionality reduction approaches, including PCA, ICA, UMAP, diffusion maps, ... . So, what do you use? And why?

scRNA-seq RNA-Seq pseudotime slingshot • 5.7k views

ADD COMMENT • link updated 3.3 years ago by ATpoint 82k • written 4.7 years ago by Friederike 8.9k

score 4 · Answer 1 · 2019-07-31

I use dimensionality reduction a lot, though not for scRNA-seq analysis. My suggestions will be general, though I hope it is still useful to you.

Right off the bat, I'd like to dispense with the notion that there is a right or wrong method for dimensionality reduction. All methods will work fine for a dataset that has two well-defined groups of data points, while probably none will work with a dataset that has hundreds of poorly separable groups of data points. What I use in a particular case depends on: 1) available time; 2) the point I need to make in terms of visualization; 3) whether I need to repeat the same analysis later on new data.

To make this post reasonable in length, I will illustrate this on t-SNE vs PCA. Visualization argument first: below are two reduced-dimensionality visualizations of the MNIST dataset that has 10 clusters - one for each of 0-9 digits.

enter image description here

Most people will agree that t-SNE does a better job than PCA, and this is not the only dataset where that argument can be made. So why don't we replace PCA with t-SNE? There are several reasons, and I will only cover most important ones. At a risk of stating the obvious, t-SNE is non-deterministic (the S in the acronym stands for stochastic). That means that two independent t-SNE runs will not yield the same solution. They will be very similar for most datasets, but not identical. PCA is deterministic, and therefore reproducible. PCA takes a minute or less even for very large datasets. t-SNE becomes very difficult for datasets larger than 50-100K points, and would take many days for datasets > 1-10 million points. PCA has no tunable parameters: give it a number of dimensions (principal components) you wish, and off it goes. One can even specify the target percent of explained variance we wish to get rather than number of dimensions. t-SNE-generated embeddings, on the other hand, depend quite a bit on a parameter called perplexity. But this is a kicker to most people that prefer PCA over t-SNE: PCA is parametric, and once we build a model it can be used on unseen data. In its original and still most widely used implementation, t-SNE is non-parametric. That means that model parameters can't be saved and later used to embed new data.

A larger point is that some methods may produce more visually appealing embedding, while others are faster or more general. I suggest you try many methods and see which one works best for your particular needs. There is a long list of available implementations here, and I also want to mention UMAP because it is similar to t-SNE but faster and more scalable to large datasets.

Lastly, it is worth mentioning that neural networks, specifically autoencoders, can do a spectacular job when it comes to non-linear dimensionality reduction. See here how an autoencoder compares to PCA on MNIST data.

score 3 · Answer 2 · 2021-01-01

3

Entering edit mode

3.3 years ago

ATpoint 82k

Late to the party, but I wanted to add a link to this video where the slingshot author was asked on exactly this issue, how to choose dimred for trajectories: His response, see link, was basically that he tries PCA first if possible, but in case it was not, UMAP was usually a good choice. In general a technique that preserves the global structure of the data (UMAP) rather than favouring local neighboring (tSNE).