Should I Run Clustering on PCA or t-SNE Components?
3
1
Entering edit mode
2.8 years ago
jrleary ▴ 190

I've gotten some conflicting opinions on this question from several different researchers I work with. For context, this question is with respect to single cell RNAseq data specifically. Some say to run clustering on the principal components, others say to run clustering on the t-SNE components (which were themselves run on top of the PCA components). This post on Cross Validated says that since t-SNE doesn't preserve density / distance relationships between points, running clustering on the t-SNE components isn't correct. This was corroborated by a colleague who described t-SNE (and UMAP) as "just visualization techniques"; I don't know if that's a correct definition of those processes though. However, another colleague prefers to run clustering on the t-SNE components, and empirically I seem to get better results visually when I cluster that way. Is there a definitive answer to this question / what is considered "best practices" in your work?

sc-rnaseq clustering • 5.0k views
4
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 20k

As you point out, and contrary to what many people believe, PCA and t-SNE are dimensionality reduction rather than clustering techniques. The fact that often we can "see" clusters in 2D or 3D representations by PCA and t-SNE means that there is internal structure in data, but it doesn't automatically lead to clustering. In that sense, both are primarily used for visualizations.

If your intent is to rigorously cluster data, especially based on distances, it should be done either on original data, or on data where non-informative features have been eliminated. Sometimes it helps to discretize the data before clustering, for example by using minimum description length binning. The latter technique also has the effect of removing non-informative features. We have to always keep in mind that dimensionality reduction removes both information and noise from the original dataset.

All that said, people cluster both on PCA and t-SNE embeddings of data quite effectively. Since PCA is a linear transformation, one can use distance-based clustering afterward, but as already pointed out you may need more than first 2 components (and sometimes many components) to include all meaningful data variance. t-SNE is non-linear and therefore doesn't preserve distances, but in my experience it does preserve the density in most cases when the perplexity parameter is chosen at least somewhat correctly. In the example you gave above, and with which I disagree, the argument is that a simple 2-cluster dataset can't be resolved when one chooses a perplexity of 20 or 40. Roughly speaking, perplexity is the minimal number of neighbors each data point is expected to have. For a dataset of 1000 total points as in that example you cited, the upper number of clusters for t-SNE to consider is roughly set at 50 (p=20), 25 (p=40) and 13 (p=80). The closer we are in our guess to the actual number of clusters (2), the more "clusterable" our embedding will be. Obviously, 50 potential clusters with p=20 is quite a bit off from the actual 2, and 13 potential clusters with p=80 is much closer. The visual would be even better (and easier to cluster) with p=100, but the original poster did not attempt that. To me, choosing a correct perplexity is similar to choosing K in K-means clustering: if your choice is far off from the actual number, you may end up with wrong clusters. We would get a wrong K-means clustering solution in the example you referenced above if we chose K=10.

From a practical point of view, if you ever decide to cluster based on t-SNE embeddings, it should be density-based rather than distance-based. As to whether it works, it has been shown many times by many people to work, though one needs to exercise caution. Having data with little or no noise helps as well. I will demonstrate this on clustering that was done on t-SNE embedding from tetranucleotide frequencies (# features = 256) of a metagenomic dataset - this means mixed DNA from many species where we don't know the exact origin. This is how the clusters look like.

To independently test whether this clustering is reliable, we have spiked in DNAs from 3 known species, and this is where they fall in that plot.

Hopefully it is clear that all 3 known genomes are resolved well within the existing structure. That doesn't mean that all clusters have been assigned with 100% accuracy, but we know from other experiments that many of them are.

In short: there is stronger mathematical justification to cluster from PCA embedding than from t-SNE, especially if one can find reliably the number of PCs to use (this is not automatic). Still, one can get just as good or better clustering with t-SNE embedding if we can find good approximation for perplexity (this is not automatic either).

0
Entering edit mode

Thank you so much for taking the time to write out such a detailed response. I'll definitely be clustering on PCs first from here on out.

2
Entering edit mode
2.8 years ago

Clustering should be performed on PCA components, as you lose a ton of sensitivity if you are only using two components to cluster cell types (as you would be with tSNE and UMAP). The number of components that are appropriate for your dataset may vary, but viewing the PCA components in a heatmap or using an elbowplot/jackplot will help you decide how many should be included to account for most of the variance within your dataset.

1
Entering edit mode
2.8 years ago

Why are you trying to cluster the data? Your goals will direct the best method.

Initially I'd agree with what you've found already: tSNE and UMAP are for visualization and will manipulate the sample-to-sample distances. Clustering done after that will be biased and give you something easy you can see. Not necessarily the most correct from a mathematical distance perspective. All of these techniques will give you clustered datasets, and their 'correctness' depends on what you're trying to achieve.

1
Entering edit mode

The goal is to cluster the cells by cell type, and then confirm the clustering results using marker genes. Since the datasets are so large / sparse, it's pretty much a necessity to reduce dimensionality before clustering. I'd agree at this point that running clustering on top of t-SNE / UMAP isn't the best way to go about it.

0
Entering edit mode

Plain PCA on the whole dataset gives the broadest picture. it is vulnerable to scaling problems: if there are two major clusters very distinct, like dead cell / alive cell, you wont be able to see nuanced differences in alive cell type A vs B. That's where a rescaler like tSNE or UMAP can be helpful. Personally, I'd do iterations. do a simple PCA to look for gross outliers, and remove those then repeat with the smaller dataset.

An alternate technique is to look at gene subsets, instead of the whole genome, your cell types may be defined by just a handful of genes and you want to ignore metabolic cycle which could dominate the genomic signature. That means subsetting the data to 'relevant' genes before doing your analysis. My work in this area is captures by bioconductor package 'rgsepd' where I use GO Terms to define interesting gene subsets to focus on.

0
Entering edit mode

and dont forget to look at PCA dimensions 3&4, there's so much beyond dimensions 1&2.

0
Entering edit mode

I typically use a procedure to choose the optimal number of PCs based on shuffling the columns of the original dataframe then comparing the % variance explained of the original PCA vs. that of PCA on the shuffled data. I choose the number of PCs based on where the % variance explained of the original falls below that of % variance explained of the shuffled data. I use those PCs to cluster / visualize. I'm also a fan of making interactive 3D scatterplots using plotly so that the third PC can be incorporated as well (plus they look cool).