Question

Data for drawing Heatmaps (RNA-seq)

2

Entering edit mode

6.5 years ago

sd.gamboa.t ▴ 50

Hello,

Please I'd like some advice..

I performed a de novo assembly (of RNA-seq reads) of the transcriptome of my target organism by means of Trinity. Next, I followed the Trinity pipeline and scripts to get the following data matrices about the assembled genes:

FPKM
TPM
TMM

My question is: Which of these data (FPKM, TPM or TMM) should I use to perform a hierarchichal clustering of the genes and draw a heatmap?

I'd like to use TMM because it is a normalized value across samples (and the trinity scripts use TMM for clustering and heatmaps). However, I've seen in some papers that the FPKM values are used instead.

Also, which kind of normalization is better for drawing a heatmap? z-score or centered log2 transformation?

Thanks in advance.

Samuel

RNA-Seq Heatmap FPKM TPM TMM • 9.2k views

ADD COMMENT • link updated 6.5 years ago by Corentin ▴ 600 • written 6.5 years ago by sd.gamboa.t ▴ 50

0

Entering edit mode

I think VST counts from DESeq2 might be a good choice (seq depth+composition bias correction) for heatmaps and MDS. But I think VST is not controlling for gene length. I am not sure if it is possible to get length normalised VST.

ADD REPLY • link 5.6 years ago by firestar ★ 1.6k

score 0 · Answer 1 · 2017-10-12

Hi,

The normalization should be performed by the tool you are using (the most popular being EdgeR, DESeq2 and limma), each one of them has a different way of normalizing the data, but if your data is robust (one of the important thing is having enough replicates), they should give similar results,

If you are using Trinity, there is a script called "run_DE_analysis.pl" which will perform the normalization (using EdgeR, DESeq or limma as you choose) and pairwise comparisons among each of your sample. To know how to run it you can just follow this trinity tutorial : https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression. As you can read on this page, this script is asking for a "matrix of raw read counts (not normalized!)". This tutorial explain every step (including drawing heatmaps).

Now, if you want information on how FPKM, RPKM and TPM work, I find this video useful (and by the way all the videos from StatQuest are good): https://www.youtube.com/watch?time_continue=608&v=TTUrtCY2k-w basically FPKM, RPKM and TPM normalize by library size (sequencing depth) and transcripts length, which should be enough if all your samples come from the same tissue.

I do not know a lot about TMM but as I understood it, it also adjusts for library composition. Meaning that it is useful if you want to compare different tissues, indeed if a gene is heavily expressed in one tissue and not the other, it will "absorb" most of the reads and the other genes will seems less expressed. Here is a video explaining how DESeq2 normalize data :

So in the end it depends on your experiment / data type.

Corentin