Usually, yes, raw counts are normalised, logged, and then some other transformation can sometimes occur prior to clustering. I'd be interested to see your data distribution after your step 2). You can plot the distribution with hist()
The actual differential expression statistical tests would have been performed on the normalised un-logged counts.
The type of data that is used for hierarchical clustering should match the chosen distance metric when calculating the distance matrix using dist() (you have not shown how you used dist()). The default of Euclidean Distance assumes that your data follows a normal distribution; thus, if using the defaults and using RNA-seq data, you should be plotting logged or Z-scaled counts, or any other data that has a normal distribution. If you're using other counts, such as the negative binomial normalised counts prior to log transformation, then you could use 1 minus Pearson / Spearman correlation as your distance metric.
You have not mentioned heatmaps, but the heatmap functions from gplots package in R will calculate the row and column dendrograms from whatever data you supply to the function, but they then usually (default) perform a transformation that involves centering and then dividing by the standard deviation for the purposes of representing values in the heatmap itself. Here is the actual code from heatmap.2:
if (scale == "row") {
retval$rowMeans <- rm <- rowMeans(x, na.rm = na.rm)
x <- sweep(x, 1, rm)
retval$rowSDs <- sx <- apply(x, 1, sd, na.rm = na.rm)
x <- sweep(x, 1, sx, "/")
}
else if (scale == "column") {
retval$colMeans <- rm <- colMeans(x, na.rm = na.rm)
x <- sweep(x, 2, rm)
retval$colSDs <- sx <- apply(x, 2, sd, na.rm = na.rm)
x <- sweep(x, 2, sx, "/")
}
In fact, I have just seen a similar answer by my colleague Michael from all of 6 years ago: Scale Data Before Drawing Heatmap Or Using Heatmap(..., Scale="Columan") In R?
Kevin
Hi Kevin. I'm new to this. Sorry if this question is a bit basic. When referring to "normal distribution", it means "all genes' expression in one sample" or "one gene's expression in all samples"?
Hello, by 'normal distribution', I mean this:
[souce: https://www.mathsisfun.com/data/standard-normal-distribution.html]
Logged and/or Z-scaled RNA-seq data should follow this distribution (but not always)
-------------------------------------
RNA-seq raw and nomalised data, however, follow the negative binomial distribution:
Thank you! I'm looking over your answers about PCA. They are very helpful, thanks again!