Need advice regarding k-means clustering and PCA
Entering edit mode
6.0 years ago
brs120c • 0

I'm doing PCA on an single-cell RNA-seq data to determine cell types based on their transcriptomic profiles. Being a newbie in Bioinformatics, I have a few questions that I'm hoping I can find answers to here:

1) Is centering and scaling necessary if you are working with log2-transformed expression values? I'm using prcomp in R.

2) I'm seeing some interesting sub-clusters emerge when I start with a k of 2 (k-means clustering) then take one of those 2 groups and cluster again using values between 2 and 4. When I start with a large k value hoping to reveal all sub-clusters in one go, the clusters overlap a lot so they don't look like convincing sub-clusters. Is there a drawback to the approach I'm taking where I take samples that fall in one cluster and cluster them again and repeat this until I see no convincing separations?

3) My PC1 and PC2 in general seem to explain roughly 6% and 4% of the total variance. This sounds really low, but given the noise level in single-cell RNAseq data, is this to be expected? Btw, my dataset has ~10000 genes and 70 samples.



RNA-seq PCA RNA-Seq • 2.0k views
Entering edit mode

Yes, I would both scale and center log2-expression values. I've forgotten if prcomp does that by default. Given (3), (only 4-6% of variance explained) I would be very cautious in interpreting any sub-clusters you are observing in (2). Are you using all genes in the PCA and k-means, or have you filtered out genes with low variance, or genes which have no significant variation across your experimental or technical groups? Noise and normalization are certainly concerns for single-cell data.


Login before adding your answer.

Traffic: 2675 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6