Question

which data is more suitable for Heatmap?

0

Entering edit mode

4.8 years ago

star ▴ 350

I have RNA-seq data (count table). I would like to draw heatmap for samples and clustering samples based on RPKM value. I would like to know what kind of RPKM output I should use as Input of Heatmap:

data = count table --> output of HTSeq

1) only normalized data like below:

data_list <- DGEList(counts = data, genes=data[1:3])
data_norm <- calcNormFactors(data_list)
RPKM <- rpkm (data_norm, data_norm$genes$gene_length)

2) normalized data after removing low expressed genes like below:

data_list <- DGEList(counts = data, genes=data[1:3])
data_filter <- rowSums(cpm(data) > 0.5) >=2 
data_keep <- data_list[data_filter, ,keep.lib.sizes=FALSE]
data_keep_norm <- calcNormFactors(data_keep)
RPKM <- rpkm (data_keep_norm, data_keep_norm$genes$gene_length)

Also, I would like to know which scaling is more suitable for considering in heatmap:

3) log2(RPKM + 0.1)

4) Z-score(RPKM) using zFPKM packages

5) Row Z-score (using the option of scale= 'row', heatmap.2)

RNA-Seq heatmap edgeR next-gen • 3.3k views

ADD COMMENT • link updated 4.8 years ago by ATpoint 81k • written 4.8 years ago by star ▴ 350

0

Entering edit mode

There is no standard. Use whichever. My preference would be zFPKM output, and to then switch off additional scaling in the heatmap function.

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

0

Entering edit mode

How does zFPKM compare with TPM and UQ-normalized raw counts in terms of inter-sample comparability on a heatmap?

Also, OP, use ComplexHeatmap instead of heatmap.2 if you have the freedom to do that. It's much easier to add features going forward.

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

I happened to be in multiple conversations with the main developer behind zFPKM relatively recently, and there is much science behind the method, which gave me much faith in using this if only presented with FPKM or RPKM. It calculates the Z-scores from R/FPKM based on empirical evidence deriving from this study: https://www.ncbi.nlm.nih.gov/pubmed/24215113

I don't know how it fairs against TPM and FPKM-UQ though

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

score 3 · Answer 1 · 2019-06-17

I prefer Z-scored normalized counts where normalized counts is the CPM output of edgeR. The advantage is that each row indicates the deviation from the row mean so it does not matter if you have a gene with 10.000 reads or 50 reads, the "scaling range" will be the same. That way you avoid the heatmap being dominated by highly-expressed genes which could happen even when using log2(CPM).

When you have a matrix or data.frame with normalized counts t( scale( t( your.data))) will get you the row-wise Z-scores. Be sure to limit the genes you plot to interesting ones, e.g. differentially-expressed or subset by any kind of clustering approach.