-----------------

Question

K-means for RNA seq gene clustering

3

Entering edit mode

2.7 years ago

curiousmind007 ▴ 30

Hi all,

I have used this tutorial (https://2-bitbio.com/2017/10/clustering-rnaseq-data-using-k-means.html) for unsupervised clustering of an RNAseq time course dataset. The input in this tutorial is the raw count table, so I'm asking myself how could I use the DEGs instead? I have 4 different timepoints and I use DESeq2 for DEG analysis. What would be prefered as input, the raw counts or the DEG?

Furthermore how could I make a dotplot of the genes and the clusters, similar to this dotplot in this thread? How to make k-means clustering plot for relative expression?

Thank you!

kmeans • 4.3k views

ADD COMMENT • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by curiousmind007 ▴ 30

3

Entering edit mode

Try filtering your dataset for DEGs, Then use z-score scaled, rlog-transformed counts as input for kmeans-clustering.

ADD REPLY • link 2.7 years ago by ponganta ▴ 590

0

Entering edit mode

Thanks for your fast reply! After I run the DESeq, how will I filter the dataset? I'm a relatively newb, so any additional information/guidance would be highly appreciated. Thanks!

ADD REPLY • link 2.7 years ago by curiousmind007 ▴ 30

score 0 · Answer 1 · 2021-07-28

What would be prefered as input, the raw counts or the DEG?

The input to that tutorial is raw counts, which then undergo normalisation. All clustering algorithms that are then applied are based on the Z-transformed (by row/gene) CPM+0.25 values, as per these lines:

z <- cpm(y, normalized.lib.size=TRUE)

scaledata <- t(scale(t(z))) # Centers and scales data.

scaledata is then used for clustering

If you want to then use the DEGs, please just filter the scaledata object to only comprise the DEGs, and then re-do clustering. For example:

degs <- c('ATM','ERBB2','ERBB3','BRCC3')

scaledata.filt <- scaledata[degs,]

-----------------

---------

Furthermore how could I make a dotplot of the genes and the clusters, similar to this dotplot in this thread? How to make k-means clustering plot for relative expression?

It may help that you clarify specifically what you are visualising in your head. While those figures may look colourful and 'nice', what they say is important for most non-sensationalistic journals. Is it:

plot of a single gene's expression per cluster?
plot of a summarised 'score' per cluster?
plot of a summarised score per gene per cluster (k-means center or PAM medoid?

...what do you want to show?

Kevin