Question

Which is the proper RNA-seq count table transformation to perform hierarchial clustering analysis?

0

Entering edit mode

3.7 years ago

antonioggsousa 3.2k

Hi,

I've a RNA-seq count table in TPM (Transcripts Per Million). Now I want to perform an hierarchical clustering analysis.

My first intuition was to log2(TPM+1) transform the data and scale (subtract the mean and divide by the standard deviation), before measuring the Euclidean distance and performing the complete linkage clustering.

Though since the units are the same, gene expression values in TPM, there is no reason to scale in order to minimize/standardize different scales/units. I usually scale, even in heatmaps because the result produces a nice balanced visualization highlighting the samples where each gene is more or less expressed. However, the aim here is different. I aim to see if replicates cluster together.

Therefore, I think that performing Euclidean distance on the original TPM matrix transformed by log2(TPM+1) would be the best approach (somehow similar to what is suggested on edgeR vignette - 2.16 Clustering, heatmaps etc - it is suggested to use logCPM counts). Though not 100% sure if there is or not any statistical reason to apply only this or I should scale too.

Any advice about which is the proper transformation to perform hierarchical clustering:

only raw TPM matrix;
transformed log2(TPM+1);
scale the transformed log2(TPM+1).

Thank you for any help or advice. I know there are similar posts on Biostars, but at least I did not found any that does this particular question. If there is any and you could indicate it, I would be glad.

António

RNA-Seq stats clustering • 2.6k views

ADD COMMENT • link updated 3.7 years ago by benformatics 3.9k • written 3.7 years ago by antonioggsousa 3.2k

score 3 · Accepted Answer · 2020-08-18

3

Entering edit mode

3.7 years ago

benformatics 3.9k

Any of those would be acceptable.

I think only the latter two options would give you nice results.

See past discussions:

ADD COMMENT • link 3.7 years ago by benformatics 3.9k

1

Entering edit mode

I vote for the third option because read counts (even on log2) scale vary greatly between genes regardless of their biological "importance", therefore transforming to the Z-scale will compensate for this issue. I would go for a more sophisticated normalization though, either using calcNormFactors followed by cpm() in edgeR or the DESeq2 implementations of vst or fpkm which all correct for both library size and composition (and some of them like fpkm additionally for gene length). edgeR has a rpkm function as well I believe which uses the TMM size factors.

ADD REPLY • link 3.7 years ago by ATpoint 81k

1

Entering edit mode

Thank you both for your prompt answers.

So, I'll use scale on the log2(TPM+1) transformed counts.

I understand your point regarding normalization, but I still need to stick with the TPM matrix, that I think is not as good as the others that you mentioned, but still accounts for library size and gene length.

António

ADD REPLY • link 3.7 years ago by antonioggsousa 3.2k