Question

Impact of the number of PCs on the clustering in scRNA seq

2

Entering edit mode

6 months ago

npont ▴ 20

Hello all,

I am working with a scRNA-seq dataset. I apply a PCA and select two different number of PCs (10 and 20). Then I apply a Louvain clustering on the reduced space with a fixed resolution (0.3) and I compare the two clusterings. I get more clusters when I selected 10 PCs than when I selected 20 PCs.

I wonder why is that and I would therefore appreciate any hint!

Thank you very much :)

(I join the scree plot showing the percentage of variance explained by each PC) enter image description here

Along with the two UMAP showing the clustering.

UMAP _ 20 PCs UMAP _ 10 PCs

clustering pca scrna-seq • 663 views

ADD COMMENT • link updated 6 months ago by rfran010 ★ 1.6k • written 6 months ago by npont ▴ 20

0

Entering edit mode

I'm not too knowledgeable on the technical points, but I assume it would make sense to have more clusters with top 10 PCs rather than 20. Top 10 PCs captures most all of the variance, whereas top 20 now introduces less variation, so in a sense you are diluting the variance with the additional PCs making samples seem more similar.

In my mind, the most important part is to make semi-objective decisions. For example, Top 10 makes sense as you should capture most of the variance. What would be the reason to take top 20? Additionally, as yura mentions, if you have domain knowledge, you can look at top enriched genes and see if they make more sense with more clusters, etc.

The way I see it, you have a frog on a bed of lily pads on the left, and some sort of crustacean or beetle, maybe a crab, on the right. So do you like frogs or crabs better? (joking of course).

ADD REPLY • link 6 months ago by rfran010 ★ 1.6k

score 1 · Answer 1 · 2025-03-06

As much as it might feel like an unsatisfactory answer, this totally depends on your downstream needs and what meaningful biology you can attach to the clustering. Some of these clusters will be based on technical variation, some will be QC-associated, some will associate with cell cycle while others will be driven by some effect based on a set of differential genes. Find out what markers are driving the clusters and you'll get a much better understanding of the clustering 'success' yourself.

When gauging the success of graphical clustering the first question to ask is how meaningful are the clusters and you do that by looking at the marker genes for that cluster and see if you can make sense of it.

For your larger UMAP (bottom right), for example, I'd be quite interested in finding out the set of genes that would resolve cluster 2 (green) and cluster 8 (blue). I would equally be interested in understanding the stripe running down the middle of cluster 1 (orange), cluster 0 (dark blue) and cluster 5 (brown).

Selecting the optimal number of PCs is not a simple question and typically the most practical solution is to iterate a few times and see if you get better resolution with a larger input set.

Since you're already on python, you can look into tools like cNMF which allow you to optimise feature selection more readily than standard graphical clustering approaches. However, I would storngly recommend exploring your data a bit more before moving on with extra technical tools. Once you've convinced yourself you understand what is going on, you'll be in a much better position to judge these kinds of things yourself.