Question: Clustering Then Scaling Or Scaling Then Clustering
4
5.8 years ago by
jth200
Turkey
jth200 wrote:

Dear All,

I have a fundamental question in microarray data analysis. The defaults of almost every heat map function in R does the hierarchical clustering first, then scales the rows then displays the image. But as you can imagine, doing the scaling first and clustering second significantly changes the appearance of a heat map as well as the clustering.

So my main question is, in their essence, is both solutions acceptable? Or only one of the strategies is correct? If so, why not the other one?

Thanks a lot!

R analysis microarray heatmap • 3.9k views
modified 2.9 years ago by Biostar ♦♦ 20 • written 5.8 years ago by jth200
2
5.8 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I will answer your question with another question: what would scaling be expected to do to your clustering? Note that there is no correct answer here without knowing more details about "clustering". As a concrete example, see the image below representing three genes (different colors) in 10 samples. If one uses a correlation-based distance, the black line and red line have distance of zero while the red and green lines are actually quite distant from one another. However, if one uses a Euclidean-like distance, the black and red lines are actually quite close to each other while the green and red lines are now very far from each other. How will scaling (which centers the data around zero) affect the distances in this example? How will scaling affect your clustering? As you can see, the answer depends on the distance metric you decide to use.

I see your point. Thank you very much for explaining with a figure, it makes understanding so easier. I used euclidean distance, so scaling based on genes reduces the distances between genes as expression levels will center around zero.

From what you have said, I can say that scaling before clustering will significantly affect clustering. But it's not clear for me whether this difference is biologically significant.

For instance, assume there is a gene A, which expression levels range from 21 to 25 in different groups, and gene B, which expression levels range from 11 to 15 and their standard deviations is such that when we scale those two genes, hypothetically, they both come into the range of -1 to 1. So for this two genes, if i cluster with euclidean distance first they will be far far away from each other, while if I cluster after scaling they will be closer.

So if I think correctly, scaling first then clustering will help me to visualize the overall trends in expression levels when comparing different groups, but i will essentially lose their expression level distances. But then question becomes to again this, is this loss biologically relevant/significant? I mean, is the knowledge of gene A expressed in 10 units, and gene B expressed in 1 units is biologically significant at all in a comparison context? To be honest, my intuition says no, since we want to compare different groups, not interested in the base expression level of genes. But it's not clear how the microarrays are done in many of the papers. How they done scaling, and most importantly when they done scaling with relative to clustering according to which distance measure?

Sorry for such blabbering, but for a starter like me, even such a little detail becomes frustrating, since i am not sure to trust my intuition yet and literature is not that clear in this little bit.