What would be prefered as input, the raw counts or the DEG?
The input to that tutorial is raw counts, which then undergo normalisation. All clustering algorithms that are then applied are based on the Z-transformed (by row/gene) CPM+0.25 values, as per these lines:
z <- cpm(y, normalized.lib.size=TRUE)
scaledata <- t(scale(t(z))) # Centers and scales data.
scaledata
is then used for clustering
If you want to then use the DEGs, please just filter the scaledata
object to only comprise the DEGs, and then re-do clustering. For example:
degs <- c('ATM','ERBB2','ERBB3','BRCC3')
scaledata.filt <- scaledata[degs,]
-----------------
---------
Furthermore how could I make a dotplot of the genes and the clusters,
similar to this dotplot in this thread? How to make k-means clustering
plot for relative expression?
It may help that you clarify specifically what you are visualising in your head. While those figures may look colourful and 'nice', what they say is important for most non-sensationalistic journals. Is it:
- plot of a single gene's expression per cluster?
- plot of a summarised 'score' per cluster?
- plot of a summarised score per gene per cluster (k-means center or PAM medoid?
...what do you want to show?
Kevin
Try filtering your dataset for DEGs, Then use z-score scaled, rlog-transformed counts as input for kmeans-clustering.
Thanks for your fast reply! After I run the
DESeq
, how will I filter the dataset? I'm a relatively newb, so any additional information/guidance would be highly appreciated. Thanks!