I've a RNA-seq count table in TPM (Transcripts Per Million). Now I want to perform an hierarchical clustering analysis.
My first intuition was to log2(TPM+1) transform the data and scale (subtract the mean and divide by the standard deviation), before measuring the Euclidean distance and performing the complete linkage clustering.
Though since the units are the same, gene expression values in TPM, there is no reason to scale in order to minimize/standardize different scales/units. I usually scale, even in heatmaps because the result produces a nice balanced visualization highlighting the samples where each gene is more or less expressed. However, the aim here is different. I aim to see if replicates cluster together.
Therefore, I think that performing Euclidean distance on the original TPM matrix transformed by log2(TPM+1) would be the best approach (somehow similar to what is suggested on
edgeR vignette - 2.16 Clustering, heatmaps etc - it is suggested to use
logCPM counts). Though not 100% sure if there is or not any statistical reason to apply only this or I should scale too.
Any advice about which is the proper transformation to perform hierarchical clustering:
only raw TPM matrix;
scale the transformed log2(TPM+1).
Thank you for any help or advice. I know there are similar posts on Biostars, but at least I did not found any that does this particular question. If there is any and you could indicate it, I would be glad.