Which counts to use for RNA-seq heatmap and PCA?
1
0
Entering edit mode
5.0 years ago
Lucy ▴ 150

Hi,

I have RNA-seq data that I would like to visualise with a PCA plot and a heatmap. I am wondering whether I should use normalised or log transformed normalised counts for this.

I have generated TMM-normalised counts per million in EdgeR as follows:

y <- calcNormFactors(y)
tmm <- edgeR::cpm(y)

I have also generated log2 transformed normalised TMM CPM:

tmm_log <- edgeR::cpm(y, log = T, prior.count = 1)

I am wondering whether it is best to use just the normalised CPMs, or the log-transformed normalised CPMs for a PCA plot and heatmap. I find that the plots look better when I use log-transformed normalised counts, but I am not sure whether this is the correct approach.

Could someone please explain why you would/would not want to use log counts?

Many thanks,

Lucy

RNA-seq heatmap EdgeR PCA • 5.3k views
ADD COMMENT
0
Entering edit mode

Thank you, I am currently scaling by row using the heatmap.2 function from the gplots package. Is this an acceptable way to do the scaling?

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This comment should go under @ATPoint's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLY
0
Entering edit mode

Without code I cannot comment.

ADD REPLY
0
Entering edit mode

heatmap.2(tmm_log, trace = "none", col = bluered(20), scale = "row")

ADD REPLY
4
Entering edit mode
5.0 years ago
ATpoint 85k

For PCA one typically uses log2 normalized counts so in this case tmm_log. For heatmaps one is typically interested in the relative differences between samples. Therefore it makes sense to Z-transform your tmm_log, e.g. by t(scale(t(tmm_log))). This will then give you the relative deviation of each sample from the mean of all samples. While technically possible to directly use tmm_log in heatmaps it is typically not a good choice. The reason is that counts are very different between genes due to the endogenous expression levels and differences in gene length so a few highly-expressed genes would dominate the heatmap. That is why Z-transformation is a good choice.

ADD COMMENT
0
Entering edit mode

Why is it preferable to use scaled log-normalised counts for heatmaps rather than just scaled normalised counts (i.e. not log transformed)?

ADD REPLY
1
Entering edit mode

To be honest, I think if you standardize (scale) the data it does not matter too much, but since one uses log-scale data pretty much for anything else (because it removes the dependency of the variance on the mean) I usually keep it consistent and use the logged data for scaling.

ADD REPLY
0
Entering edit mode

Ok thank you, I saw that it made a difference to how my heatmap looked, so I wasn't sure what to do.

ADD REPLY
0
Entering edit mode

Sure, any little difference can make clustering look different. I would decide for one and then stick with that, I think just log counts is fine as one uses it everywhere, so for consistency.

ADD REPLY

Login before adding your answer.

Traffic: 2258 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6