heatmaps of RNA-seq data: what are accepted normalizations?
1
1
Entering edit mode
2.3 years ago
bwio ▴ 30

Hi everyone,

I want to draw a heatmap of my RNA-seq data and was wondering, what kind of normalizations or transformations of the data are 'acceptable'.

The data is from an RNA-seq experiment with multiple treatments. I am using R, RPKM data from DEseq and pheatmap in this case, but the question is agnostic from this. The log2 data from the example plot is below.

1) Of course, plotting unscaled RPKM data is not satisfactory because of the highly different expression of genes.

2) People frequently use log-transformed RPKM data which mostly overcomes this problem, while still preserving information about different expression strength between genes and changes across conditions. This already gives already quite good results! However, depending on the color sheme, the attention is still drawn to the highest and lowest expressed genes, while the changes of medium-expressed genes don't stand out and tend to be overlooked.

Example: The expression changes in gene 9 and 10 are equally strong as for gene 5, but barely visible! Also, Gene 3 has the strongest expression changes, still appears just moderate.

I tried a few color shemes, but none could really avoid this problem, since the range of values spans from ~ -3 to ~ 12

3) The pheatmap package has a scale="column" option, that normalizes all columns (here: expression per gene) to Mean = 0 and SD = 1 by a simple transformation:

function (x)
{
m = apply(x, 1, mean, na.rm = T)
s = apply(x, 1, sd, na.rm = T)
return((x - m)/s)
}


I think, this transformation cannot be applied in this context, since the changes per gene are scaled and you cannot tell anymore, if the changes between treatments are small or large.

4) I figured that just centering the means (mean = 0 by substracting column means) WITHOUT applying the scaling (dividing by SD) might be a good transformation, since the strength of expression changes is preserved but the scale range is the same for every gene unlike in the 2nd example.

I really like the 4th approach, but I am not sure how this would be considered by statisticans or reviewers. Any thoughts, concerns and alternative approached on this? I am open to any comments!

Of course, an alternative approach would be plotting (log2-) fold changes relative to the control condition, which is then usally not shown, because it is = 1 for all cases. Personally I don't find this intuitive, but I will opt into it if all else fails.

log2 transformed RPKM data
|        | condition 1| condition 2| condition 3| condition 4|
|:-------|-----------:|-----------:|-----------:|-----------:|
|Gene 1  |   7,8581343|   7,7760309|   5,9845672|   5,4520139|
|Gene 2  |  -0,5718784|  -0,2093000|  -3,6001834|  -3,2168730|
|Gene 3  |  -0,8589436|  -0,8816851|   4,6821235|   3,6949570|
|Gene 4  |   0,3470576|   0,6363893|   0,9525574|   2,9178951|
|Gene 5  |  10,4977433|  12,6749883|   9,3837503|  10,0322095|
|Gene 6  |  -2,7455632|  -0,7356815|  -0,1536280|   1,5129004|
|Gene 7  |   0,4665848|  -1,8998796|  -0,3328291|   2,0449151|
|Gene 8  |  -1,4584591|  -3,0859498|  -0,8403647|   0,2276394|
|Gene 9  |   4,3642826|   2,1160312|   4,0913918|   4,4801707|
|Gene 10 |   6,6822088|   6,9493600|   4,8194632|   4,6275799|


RNA-Seq R heatmap normalization • 2.9k views
4
Entering edit mode
2.3 years ago

Can you give some indication of your overall pipeline? I am aware that the DESeq2 functions include one that outputs RPKM/FPKM; however, neither of these expression units is suitable for differential expression comparisons. Usually, we input raw counts to DESeq2, normalise these (using DESeq2), and then perform statistical comparisons based on the DESeq2-normalised counts (using DESeq2 functions).

For plotting data, like, e.g., a heatmap, you can feasibly log the RPKM/FPKM expression levels because it is just for visualisation; however, again, statistical comparisons should not be performed using these units.

On the issue of scaling: the scaling functions of both pheatmap() and heatmap.2() (gplots) are both the same as just doing t(scale(t(x))) - here is my proof: A: cannot replicate the pheatmap scale function

Another issue of which to be wary regarding the manual scaling of your data: if you activate scaling in pheatmap(), it will scale your data and then perform clustering on the scaled data, and also generate the heatmap based on this. See here: C: Clustering differences between heatmap.2 and pheatmap (however, as you are not even activating clustering, this part is not quite relevant for you)

Now, after all of that, my ultimate answer: I feel that you are 'over worrying' about this. The most important part is how you selected your genes - the heatmap is purely for visualisation. I really do not believe it will be a 'crunch' issue how you portray the genes in the heatmap. My preference, if I only had RPKM/FPKM expression levels, would be to scale these to standardised Z scores using zFPKM, and then plot these. If not that, just go with whatever best represents your result, keeping in mind that it is purely for visualisation.

Kevin

1
Entering edit mode

Hello Kevin, thank you for this really fast and comprehensive answer!

The whole sequencing and computational pipeline was run by a collaboration partner, so I have to consider this a black box. I am just reading the RPKM, l2fc and p values from the generated reports. Thus, i just intend to find a good representation of the data, without doing any testing.

Yeah, I tend to over-worry a lot ;-)! I will look at the zFPKM function!

Also thank you for pointing out the clustering issue! I tried clustering, but left it out for this minimal example.

Heinrich