Question

Which counts to use for RNA-seq heatmap and PCA?

0

Entering edit mode

4.4 years ago

Lucy ▴ 140

Hi,

I have RNA-seq data that I would like to visualise with a PCA plot and a heatmap. I am wondering whether I should use normalised or log transformed normalised counts for this.

I have generated TMM-normalised counts per million in EdgeR as follows:

y <- calcNormFactors(y)
tmm <- edgeR::cpm(y)

I have also generated log2 transformed normalised TMM CPM:

tmm_log <- edgeR::cpm(y, log = T, prior.count = 1)

I am wondering whether it is best to use just the normalised CPMs, or the log-transformed normalised CPMs for a PCA plot and heatmap. I find that the plots look better when I use log-transformed normalised counts, but I am not sure whether this is the correct approach.

Could someone please explain why you would/would not want to use log counts?

Many thanks,

Lucy

RNA-seq heatmap EdgeR PCA • 4.7k views

ADD COMMENT • link updated 2.7 years ago by ATpoint 82k • written 4.4 years ago by Lucy ▴ 140

0

Entering edit mode

Thank you, I am currently scaling by row using the heatmap.2 function from the gplots package. Is this an acceptable way to do the scaling?

ADD REPLY • link 4.4 years ago by Lucy ▴ 140

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This comment should go under @ATPoint's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

Without code I cannot comment.

ADD REPLY • link 4.4 years ago by ATpoint 82k

0

Entering edit mode

heatmap.2(tmm_log, trace = "none", col = bluered(20), scale = "row")

ADD REPLY • link 4.4 years ago by Lucy ▴ 140

score 4 · Answer 1 · 2019-12-10

4

Entering edit mode

4.4 years ago

ATpoint 82k

For PCA one typically uses log2 normalized counts so in this case tmm_log. For heatmaps one is typically interested in the relative differences between samples. Therefore it makes sense to Z-transform your tmm_log, e.g. by t(scale(t(tmm_log))). This will then give you the relative deviation of each sample from the mean of all samples. While technically possible to directly use tmm_log in heatmaps it is typically not a good choice. The reason is that counts are very different between genes due to the endogenous expression levels and differences in gene length so a few highly-expressed genes would dominate the heatmap. That is why Z-transformation is a good choice.

ADD COMMENT • link 4.4 years ago by ATpoint 82k

0

Entering edit mode

Why is it preferable to use scaled log-normalised counts for heatmaps rather than just scaled normalised counts (i.e. not log transformed)?

ADD REPLY • link 2.7 years ago by Lucy ▴ 140

1

Entering edit mode

To be honest, I think if you standardize (scale) the data it does not matter too much, but since one uses log-scale data pretty much for anything else (because it removes the dependency of the variance on the mean) I usually keep it consistent and use the logged data for scaling.

ADD REPLY • link 2.7 years ago by ATpoint 82k

0

Entering edit mode

Ok thank you, I saw that it made a difference to how my heatmap looked, so I wasn't sure what to do.

ADD REPLY • link 2.7 years ago by Lucy ▴ 140

0

Entering edit mode

Sure, any little difference can make clustering look different. I would decide for one and then stick with that, I think just log counts is fine as one uses it everywhere, so for consistency.

ADD REPLY • link 2.7 years ago by ATpoint 82k