Clustering Then Scaling Or Scaling Then Clustering
1
4
Entering edit mode
10.3 years ago
jth ▴ 190

Dear All,

I have a fundamental question in microarray data analysis. The defaults of almost every heat map function in R does the hierarchical clustering first, then scales the rows then displays the image. But as you can imagine, doing the scaling first and clustering second significantly changes the appearance of a heat map as well as the clustering.

So my main question is, in their essence, is both solutions acceptable? Or only one of the strategies is correct? If so, why not the other one?

Thanks a lot!

microarray heatmap analysis r • 5.7k views
2
Entering edit mode
10.3 years ago

I will answer your question with another question: what would scaling be expected to do to your clustering? Note that there is no correct answer here without knowing more details about "clustering". As a concrete example, see the image below representing three genes (different colors) in 10 samples. If one uses a correlation-based distance, the black line and red line have distance of zero while the red and green lines are actually quite distant from one another. However, if one uses a Euclidean-like distance, the black and red lines are actually quite close to each other while the green and red lines are now very far from each other. How will scaling (which centers the data around zero) affect the distances in this example? How will scaling affect your clustering? As you can see, the answer depends on the distance metric you decide to use.

0
Entering edit mode

I see your point. Thank you very much for explaining with a figure, it makes understanding so easier. I used euclidean distance, so scaling based on genes reduces the distances between genes as expression levels will center around zero.

From what you have said, I can say that scaling before clustering will significantly affect clustering. But it's not clear for me whether this difference is biologically significant.

For instance, assume there is a gene A, which expression levels range from 21 to 25 in different groups, and gene B, which expression levels range from 11 to 15 and their standard deviations is such that when we scale those two genes, hypothetically, they both come into the range of -1 to 1. So for this two genes, if i cluster with euclidean distance first they will be far far away from each other, while if I cluster after scaling they will be closer.

So if I think correctly, scaling first then clustering will help me to visualize the overall trends in expression levels when comparing different groups, but i will essentially lose their expression level distances. But then question becomes to again this, is this loss biologically relevant/significant? I mean, is the knowledge of gene A expressed in 10 units, and gene B expressed in 1 units is biologically significant at all in a comparison context? To be honest, my intuition says no, since we want to compare different groups, not interested in the base expression level of genes. But it's not clear how the microarrays are done in many of the papers. How they done scaling, and most importantly when they done scaling with relative to clustering according to which distance measure?

Sorry for such blabbering, but for a starter like me, even such a little detail becomes frustrating, since i am not sure to trust my intuition yet and literature is not that clear in this little bit.

0
Entering edit mode

You are asking the correct questions. A non-uncommon approach is Euclidean-based distance for clustering (we need to talk about linkage functions here, but I'll leave that for you to read) followed by scaling of genes for display. However, that is not to say that there is a standard. In practice, try different combinations to see what you get.