Question: which matrix should be used to draw heatmap in RNAseq?
15 months ago by
United States
pbigbig210 wrote:

Hi everyone,

I have some confusion about which type of expression matrices should I use for heatmap visualization of RNAseq data. There are 3 options listed below: (raw count matrix was obtained from featureCounts, TMM cross-sample normalization is performed by edgeR)

  1. TMM-normalized raw count

  2. TPM value calculated from raw count

  3. TPM value calculated from raw count, then TMM-normalized

Additionally, should I use log2(x+1) transformation before row-scaling when drawing heatmap? Because in some cases, I saw that row-scaling was enough to signify the difference. I am new to this, so any detail explanation is highly appreciated.

Thank you very much in advance!

heatmap normalization rnaseq • 1.2k views
written 15 months ago by pbigbig210
15 months ago by
Sheffield, UK
i.sudbery9.1k wrote:

In what follows, I'm assuming that you want to have genes in rows, and samples in columns. The answer might be different if this is not the case, hopefully for reasons that will make sense after the following....

Both TMM and TPM proceedures include a step to normalise for the difference between samples in an attempt to make the measurement for a given gene comparable between samples. However, TMM does a better job of this.

TPM also includes steps that attempt to normalize expression values such that they are comparable between two different genes WITHIN one sample (e.g. is Gene A or Gene B more highly expressed).

Normally the recommendation, if you have to choose between counts and TPM is to choose TPM (or TPM caluculated from TMM-normalised counts). But if you plan to do row normalisation, then this will undo the TPM transformation anyway.

However, as hinted at in your final question, there is another transformation that needs to be considered: variance stabilisation. Log2 is often used as variance stabilising transform in many fields, but because we deal with a lot of zeros, it is often not suitable. One solution is to add a pseudo-count - this both further stabilises the variance, and deals with the zeros problem, but the choice of + 1 is pretty arbitrary. Luckily, there are more sophisticated alternatives, the most common being regularized log and vst both provided by DESeq2. These transforms will also deal with normalising raw counts in a manner similar to the TMM normalization of edgeR.

A final alternative, if you wish to stay in the edgeR universe, is limma.voom which will take an edgeR object and apply transforms so that its variance is somewhat stabilised, but I know less about that.

written 15 months ago by i.sudbery9.1k

Thank you very much for your comprehensive answer!

So as I understand, graphically in expression matrix, purpose of TPM is for same-column comparison and TMM is for same-row comparison. I think scaling by row will only benefit those who only interested in clusters of highly/lowly expressed genes in relative meaning (high/low compared to the same gene in other samples). Clustering in non-scaled-row matrix may give more informative clusters, I suppose.

written 15 months ago by pbigbig210

Depends on your distance matrix. Euclidean distance on a row-scale matrix is roughly equivalent to pearson distance on a none-scaled matrix.

written 15 months ago by i.sudbery9.1k
15 months ago by
predeus1.4k wrote:

Depends on why do you need the visualization, right? Visualization can be done to explore the data, or to make a point in the publication etc

If you want to explore the data, you can also try and specify for yourself, what exactly you want to find out - specific genes? Certain pathways? all this matters a lot

Various normalizations and transformations can be quite useful, but they also distort the original data.

written 15 months ago by predeus1.4k
