I'm doing PCA on an single-cell RNA-seq data to determine cell types based on their transcriptomic profiles. Being a newbie in Bioinformatics, I have a few questions that I'm hoping I can find answers to here:
- Is centering and scaling necessary if you are working with log2-transformed expression values? I'm using prcomp in R.
- I'm seeing some interesting sub-clusters emerge when I start with a k of 2 (k-means clustering) then take one of those 2 groups and cluster again using values between 2 and 4. When I start with a large k value hoping to reveal all sub-clusters in one go, the clusters overlap a lot so they don't look like convincing sub-clusters. Is there a drawback to the approach I'm taking where I take samples that fall in one cluster and cluster them again and repeat this until I see no convincing separations?
- My PC1 and PC2 in general seem to explain roughly 6% and 4% of the total variance. This sounds really low, but given the noise level in single-cell RNAseq data, is this to be expected? Btw, my dataset has ~10000 genes and 70 samples.