Question

Clustering resolution for scRNA-seq

1

Entering edit mode

3.8 years ago

i.am.filippov ▴ 10

This is an open-ended question, but maybe you could share your heuristics. What's your approach when clustering and annotating clusters in scRNA-seq data? Do you prefer to start with a low number of clusters(e.g. CD4, CD8, monocytes) and then re-cluster or look for low-level cell types from the get-go(e.g. Treg, Th, etc)? I've seen both approaches in the literature. Are there any pros and cons?

clustering single-cell • 5.3k views

ADD COMMENT • link updated 3.8 years ago by firestar ★ 1.7k • written 3.8 years ago by i.am.filippov ▴ 10

score 8 · Answer 1 · 2021-09-14

I think this is pretty subjective and data dependent. I can share my strategy for data where very little prior biological knowledge is available. I run clusters for a range of resolutions, say for example a range from 0.01 to 1. The range depends on the expected number of cell types in the dataset and visual complexity I see on the UMAP. Then plot resolution and number of clusters (k) to get a rough idea of cluster stability. I don't take this too seriously as it's probably not very reliable, but it gives some sense.

enter image description here

Then I take the mean or median resolution for each k and plot UMAPS, so I can visually see how clusters split and change.

enter image description here

The I run clustree to visualise how clusters split over k. Which clusters go into which clusters and how they change. This is a really useful diagnostic.

enter image description here

And then probably the most important thing is if the clusters make biological sense. So, I create heatmaps/dotplots of top genes for each k.

enter image description here

And lastly featureplot (gene expression on UMAP), violin plots etc of known marker genes grouped by clusters at various k is very useful. If your dataset is composed of well studied tissues (blood etc) with well known markers, then this is all you probably need.

So far, this is only "tuning" the resolution parameter. One could tinker with the number of neighbours used to build the neighbourhood graph and this dramatically changes everything because the graph itself has changed and not just the number of clusters emerging from one graph. And then you tinker with more parameters for a few more weeks and you start to wonder if anything is reliable or stable. It's easy to get lost if there is no strong biology to guide your decision making.