which data is more suitable for Heatmap?
1
0
Entering edit mode
4.8 years ago
star ▴ 350

I have RNA-seq data (count table). I would like to draw heatmap for samples and clustering samples based on RPKM value. I would like to know what kind of RPKM output I should use as Input of Heatmap:

data = count table --> output of HTSeq

1) only normalized data like below:

data_list <- DGEList(counts = data, genes=data[1:3])
data_norm <- calcNormFactors(data_list)
RPKM <- rpkm (data_norm, data_norm$genes$gene_length)

2) normalized data after removing low expressed genes like below:

data_list <- DGEList(counts = data, genes=data[1:3])
data_filter <- rowSums(cpm(data) > 0.5) >=2 
data_keep <- data_list[data_filter, ,keep.lib.sizes=FALSE]
data_keep_norm <- calcNormFactors(data_keep)
RPKM <- rpkm (data_keep_norm, data_keep_norm$genes$gene_length)

Also, I would like to know which scaling is more suitable for considering in heatmap:

3) log2(RPKM + 0.1)

4) Z-score(RPKM) using zFPKM packages

5) Row Z-score (using the option of scale= 'row', heatmap.2)

RNA-Seq heatmap edgeR next-gen • 3.3k views
ADD COMMENT
0
Entering edit mode

There is no standard. Use whichever. My preference would be zFPKM output, and to then switch off additional scaling in the heatmap function.

ADD REPLY
0
Entering edit mode

How does zFPKM compare with TPM and UQ-normalized raw counts in terms of inter-sample comparability on a heatmap?

Also, OP, use ComplexHeatmap instead of heatmap.2 if you have the freedom to do that. It's much easier to add features going forward.

ADD REPLY
0
Entering edit mode

I happened to be in multiple conversations with the main developer behind zFPKM relatively recently, and there is much science behind the method, which gave me much faith in using this if only presented with FPKM or RPKM. It calculates the Z-scores from R/FPKM based on empirical evidence deriving from this study: https://www.ncbi.nlm.nih.gov/pubmed/24215113

I don't know how it fairs against TPM and FPKM-UQ though

ADD REPLY
3
Entering edit mode
4.8 years ago
ATpoint 81k

I prefer Z-scored normalized counts where normalized counts is the CPM output of edgeR. The advantage is that each row indicates the deviation from the row mean so it does not matter if you have a gene with 10.000 reads or 50 reads, the "scaling range" will be the same. That way you avoid the heatmap being dominated by highly-expressed genes which could happen even when using log2(CPM).

When you have a matrix or data.frame with normalized counts t( scale( t( your.data))) will get you the row-wise Z-scores. Be sure to limit the genes you plot to interesting ones, e.g. differentially-expressed or subset by any kind of clustering approach.

ADD COMMENT

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6