Question: Heatmap : why scaling ?
0
gravatar for Francois
9 months ago by
Francois0
INRA
Francois0 wrote:

Hi, I have to do a heatmap from fold change data. I got a large diversity in the values (for example a few are between 1200 and 500, but the vast majority ot them are below 2). According to what I could read, I suppose I have to "scale" my heatmap but I can't understand why and how it is done. I couldn't find any clear information about it. Can somebody explain? Thanks

heatmap scale • 1.2k views
ADD COMMENTlink modified 9 months ago • written 9 months ago by Francois0

To rephrase my question, I don't understand why we have to "scale" the data, is a log2 transfomration not enough ?

ADD REPLYlink written 9 months ago by Francois0

Hi francois.piumi,

welcome to Biostars. Please elaborate on your question. What kind of data is this, and how did you obtain the fold-changes? Reporting a log2-FC appears reasonable given you have a large data range, but this always depends on the context. It would also help to link the sources where you read about the scaling.

ADD REPLYlink modified 9 months ago • written 9 months ago by ATpoint16k

Thank you very much for your warm welcome and for all your very kind and interesting answers.

Sorry for not having given more details about my data. They are coming from PCR array experiments (3 cell types infected or not by viruses). The fold change values are an output of the Qiagen's "RT2 Profiler PCR Array Data Analysis Webportal".

Here are my main sources : http://www.opiniomics.org/you-probably-dont-understand-heatmaps/ https://github.com/slowkow/slowkow.com/blob/master/_rmd/2017-02-16-heatmap-tutorial.R https://bioramble.wordpress.com/2015/07/30/heatmaps-part-2-how-to-create-a-heatmap-with-r-in-an-ideal-world/

A simple "pheatmap" command with my values and a "greenred" color code gave me an almost full lightgreen heatmap with a few red lines. I wanted to send you an image but I couldn't find the way to do it. Indeed a new heatmap with log2FC transformed values helped me to see a few more red and black lines, but the heatmap always contains a lot of lightgreen lines. So I think I will have to scale my data to have a better visualisation. But is there really a biological signification behind this scaling ? How can I relate a scaled-heatmap with a non-scaled one ? It's not totally clear to me.

My distribution is absolutely not normal. So if I understood well, I should use correlation instead of Euclidean. Is it correct ?

In my former lab, I was also recommanded to use a "Ward" linkage. What do you think about it ?

Thanks for your answers.

Francois

ADD REPLYlink written 9 months ago by Francois0

My distribution is absolutely not normal. So if I understood well, I should use correlation instead of Euclidean. Is it correct ?

As per the comments, you can use Euclidean distance. However, if you try to publish it and a reviewer is someone like me, then you would receive a yellow card for using Euclidean on a non-normal distribution - nothing major, though. For me, correlation distance with Spearman's rho coefficient would be more appropriate.

You can also use Ward's linkage,. However, be sure to select 'ward.D2' and not 'ward'. 'ward' is not the true Ward's linkage method... a sort of bug that has been identified but left in the R hclust() function.

ADD REPLYlink written 9 months ago by Kevin Blighe42k

Thank you very much Kevin!

ADD REPLYlink written 9 months ago by Francois0
1

Please do not add answers unless you're answering the top level question. This belongs as a comment-reply to Kevin's comment, which you can add using the Add Comment or Add Reply options.

ADD REPLYlink written 9 months ago by RamRS21k

I understand better now what is the purpose of the "scaling".

Nevertheless, I still don't understand its effect on my data values. For example the gene that has the highest FC value in one condition and so appears in red in a heatmap without scaling ('redgreen' heatmap color range), now appears in a dark red color in the heatmap with scaling. It seems that the scaling modified this value, lowering it and transforming it in a value close to 0.5 on a 1 to -1 ranging scale (?).

And it's the same for genes that has a lower FC value that now after scaling got values close to the maximum (which is 1). On the scaled heatmap those gene appears with a very bright red color suggesting gene expression being higher whereas their FC are lower.

I think I quite don't understand the interpretation of "scaled" heatmap whose values are not following FC values and the signification of the ranging from 1 to -1. Could you help me please to interpret them ?

ADD REPLYlink written 9 months ago by Francois0

sorry to ask again my question, but could someone help me to understand data scaling? why couln't I find a standard protocol to draw a heatmap?

ADD REPLYlink written 9 months ago by Francois0

Please read Kevin's comment here: C: Heatmap : why scaling ?

There is no standard protocol to draw a heatmap, because a heatmap is a visualization tool that helps you visualize what you wish to show. You need to be clear what you wish to show and the way to create a heatmap for that will surface when you search with that clarity.

Scaling a data set with center=TRUE, scale=TRUE is essentially transforming values to the Z-scale, smoothing out differences. It makes the data look good at the cost of losing information on how extreme an outlier is. But then, you're going to map colors to the scale anyway, so how extreme an outlier is will not matter much.

ADD REPLYlink modified 9 months ago • written 9 months ago by RamRS21k
4
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe42k
Kevin Blighe42k wrote:

To follow on from ATpoint: plotting log2 fold changes is fine, or even just plotting normalised values on the negative binomial scale. There are no standards.

In the case where your data is not normally-distributed ('bell curve'), though, you should be using something like Spearman correlation distances instead of Euclidean distances, i.e., for the purposes of hierarchical clustering.

Scaling the data (to the Z-scale) just helps to 'even out' any creases that may still exist in the data, which helps for visualisation. It's strictly not a necessary procedure and there are no standards.

Also realise the distinction: we can cluster the data on one distribution (e.g. log2 expression values) for generating row and column dendrograms, and then use a different distribution (e.g. Z scores) for display in the heatmap.

ADD COMMENTlink modified 6 months ago • written 9 months ago by Kevin Blighe42k

Scaling the data (to the Z-scale) just helps to 'even out' any creases that may still exist in the data, which helps for visualisation. It's strictly not a necessary procedure and there are no standards.

This is what I was thinking also. It reduces the number of outliers if the expression varies wildly.

ADD REPLYlink written 9 months ago by RamRS21k

Yep, like ironing out those creases in a shirt

ADD REPLYlink written 9 months ago by Kevin Blighe42k

I don't think the second point is correct.

You can use hierarchical clustering as it doesn't assume a distribution nor does Euclidean distance. Euclidean distance is more general than Pearsons correlation as Pearsons does assume normality. Euclidean can even be used to cluster binary data. A z-score does assume normality, we normally do this for features when they have been transformed using log2, so maybe this is what you were thinking of? I think this may be violated a bit, but isn't much of a problem.

It's also not clear whether your talking about the distribution in terms of the ith feature for all samples or jth sample for all features? This is important too. For instance, the feature distribution assuming we have only biological replicates should be a negative binomial when talking about CPM.

ADD REPLYlink modified 9 months ago • written 9 months ago by chris86250

I don't think the second point is correct.

Euclidean distance works 'better' when the distribution is normally-distributed (i.e. a Gaussian). However, it's not a life or death situation to which we are referring here.

Please read this: CLASSIFICATION AND CLUSTERING OF SEQUENCING DATA USING A POISSON MODEL

"Consequently, analytic tools that assume a Gaussian distribution (such as classification methods based on linear discriminant analysis and clustering methods that use Euclidean distance) may not perform as well for sequencing data as methods that are based upon a more appropriate distribution."

In relation to correlation distance, let us clarify so that people do not mis-interpret your words: Pearson correlation is parametric; Spearman correlation is non-parametric. If your data is not normally-distributed, use Spearman correlation distances.

ADD REPLYlink modified 6 months ago • written 9 months ago by Kevin Blighe42k

I understand better now what is the purpose of the "scaling".

Nevertheless, I still don't understand its effect on my data values. For example the gene that has the highest FC value in one condition and so appears in red in a heatmap without scaling ('redgreen' heatmap color range), now appears in a dark red color in the heatmap with scaling. It seems that the scaling modified this value, lowering it and transforming it in a value close to 0.5 on a 1 to -1 ranging scale (?).

And it's the same for genes that has a lower FC value that now after scaling got values close to the maximum (which is 1). On the scaled heatmap those gene appears with a very bright red color suggesting gene expression being higher whereas their FC are lower.

I think I quite don't understand the interpretation of "scaled" heatmap whose values are not following FC values and the signification of the ranging from 1 to -1. Could you help me please to interpret them ?

ADD REPLYlink written 9 months ago by Francois0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 841 users visited in the last hour