Need advice regarding k-means clustering and PCA
0
0
Entering edit mode
6.0 years ago
brs120c • 0

I'm doing PCA on an single-cell RNA-seq data to determine cell types based on their transcriptomic profiles. Being a newbie in Bioinformatics, I have a few questions that I'm hoping I can find answers to here:

1) Is centering and scaling necessary if you are working with log2-transformed expression values? I'm using prcomp in R.

2) I'm seeing some interesting sub-clusters emerge when I start with a k of 2 (k-means clustering) then take one of those 2 groups and cluster again using values between 2 and 4. When I start with a large k value hoping to reveal all sub-clusters in one go, the clusters overlap a lot so they don't look like convincing sub-clusters. Is there a drawback to the approach I'm taking where I take samples that fall in one cluster and cluster them again and repeat this until I see no convincing separations?

3) My PC1 and PC2 in general seem to explain roughly 6% and 4% of the total variance. This sounds really low, but given the noise level in single-cell RNAseq data, is this to be expected? Btw, my dataset has ~10000 genes and 70 samples.

Thanks!

 

RNA-seq PCA RNA-Seq • 2.0k views
ADD COMMENT
2
Entering edit mode

Yes, I would both scale and center log2-expression values. I've forgotten if prcomp does that by default. Given (3), (only 4-6% of variance explained) I would be very cautious in interpreting any sub-clusters you are observing in (2). Are you using all genes in the PCA and k-means, or have you filtered out genes with low variance, or genes which have no significant variation across your experimental or technical groups? Noise and normalization are certainly concerns for single-cell data.

ADD REPLY

Login before adding your answer.

Traffic: 2675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6