Question: When Using Heatmap.2 From R To Make A Heatmap Of Microarray Data, How Are The Genes Clustered?
6
Niek De Klein2.5k wrote:

I am using heatmap.2 from R to cluster microarry data. The figure below was made with

``````heatmap.2(x, col=redgreen(75), scale="row", ColSideColors=col,  key=TRUE, symkey=FALSE,                             density.info="none",cexRow=1,cexCol=1,margins=c(6,11),  trace="none",srtCol=45)
`````` Looking at the expression of this figure I would think that the lowest three genes would be clustered closer to the upper 6 genes, as both groups down regulated in condition 1 (red) and upregulated in condition 2 (blue). Why are the first 6 genes clustered with the middle four, even though they show opposite patterns? The manpage of heatmap.2 says that it orders columns and rows on mean expression, so do the groups of genes that are clustered together have a similar average expression, and the pattern doesn't matter?

microarray • 27k views
modified 6.5 years ago by Devon Ryan96k • written 6.5 years ago by Niek De Klein2.5k
17
Devon Ryan96k wrote:

As you can see from the other replies, there are a couple ways of reading your question. The simpler version (answered partly by Mitch) could be rewritten, "Regardless of the fact that the top 6 and bottom 3 genes cluster together, why aren't the dendrogram branchpoints rotated such that they're neighboring each other?" The answer to that is found in `help(hclust)`, quoting:

In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which on the right. ... The algorithm used in ‘hclust’ is to order the subtree so that the tighter cluster is on the left (the last, i.e., most recent, merge of the left subtree is at a lower value than the last merge of the right subtree). Single observations are the tightest clusters possible, and merges involving two observations place them in order by their observation sequence number.

So that's why the tree is ordered why it is.

The other understanding of your question (addressed in a shorter version by Irsan) is, "Since the top 6 and bottom 3 genes change in the same direction, why don't they cluster together?" The answer to that is what Irsan wrote, namely that things are (by default) clustered with the "complete" method according to their euclidean distance. But of course, what does that actually mean? Probably the most important part of that is the euclidean distance. The easiest way to visualize that is graphically, so I'll make some test data that will more-or-less act like yours:

``````library(gplots)
d <- matrix(rnorm(78,c(rep(c(5,5,5,3,3,3),6), rep(c(2,2,2,3,3,3),4), rep(c(1,1,1,0,0,0),3)), 0.2), ncol=6, byrow=T)
``````

The `d` matrix will then be something (the numbers are random) this:

``````           [,1]     [,2]      [,3]       [,4]        [,5]        [,6]
[1,] 5.1712874 4.946509 4.7496928 2.63230256  2.63394566  3.08378751
[2,] 4.6592866 4.806637 4.9270387 3.00397576  2.96095830  3.05562337
[3,] 5.1117932 4.982021 4.9432372 2.79170612  2.68319462  3.28125643
[4,] 5.1939668 5.103573 4.9274765 2.75459710  2.92578504  3.17743957
[5,] 4.8172162 4.869808 4.6627615 2.69654061  2.78337580  3.19926029
[6,] 5.0833024 5.280635 5.0151975 2.79977773  3.18152508  3.04505210
[7,] 2.2524162 2.079043 2.1589240 2.98382139  3.23926676  3.01663328
[8,] 1.9869665 1.861576 2.1608973 2.94434318  2.50710978  2.75273211
[9,] 1.8246908 2.119846 2.5934129 3.02386462  2.98645345  2.40102654
[10,] 2.2040333 2.248246 1.9449646 2.84486854  2.85842004  3.10360764
[11,] 1.1539225 1.005955 0.8953501 0.07245758 -0.03334591 -0.14624306
[12,] 0.8860263 0.752870 0.7780058 0.11035305 -0.22448713  0.07403528
[13,] 1.1470910 1.088321 1.0455616 0.15681927  0.06387492 -0.13126294
``````

and a heatmap like this: The Euclidean distance between two rows (`row_a` and `row_b`) is defined as `sqrt(sum(row_b-row_a)^2)`, using the vector functions in R (I only mention that because if you're used to thinking in, say, C, then this wouldn't make sense since these function only work with single values). You can get a nice visual of this by just plotting a subset:

``````matplot(t(d), xlab="Sample", ylab="Expression Value")
``````

This will look something like the following, with the rows points labeled given their row number (1-C instead of 1-13). As you can see, the top 6 rows in the matrix are really far away from the bottom 3 on this graph. In fact, the Euclidean distance between them is around `21`. The distance between the top 6 and middle 4 is only around `9` and the distance between the middle 4 and bottom 3 around `12`. These numbers match how closely things look by eye-ball. So, it's not so much that things are ordered by average expression but how closely they are to each other as measured by Euclidean distance.

The actual clustering procedure isn't so important in this example. The default is the `complete` method in `hclust`. Here, the distance between two clusters `A` and `B` is the maximum of the distances between the elements of `A` and `B`. One could instead use the `single` method, which takes the minimum distance between the elements of `A` and `B`, among many others. Each of these has its pros and cons, such as the `single` method being more prone to merging clusters if one has an outlier (in many cases, but see below). N.B., if you ever need to calculate distance matrices in R and use the `Ward` method, remember to `hclust(dist(x)^2, method="ward")` rather than `hclust(dist(x), method="ward")` to get the right numbers (most people don't do this and almost all tutorials that use the `ward` method are wrong!).

Thought experiment: Suppose you did `d[8,6] <- 1`. Given what I just wrote regarding the `complete` and `single` methods, how do you think the results could differ had you used the `single` method instead? This can be checked by comparing `plot(hclust(dist(d)))` and `plot(hclust(dist(d), method="single"))`.

Second though experiment: Suppose you instead did `d[1,1] <- 6`. Which of the `single` and `complete` methods produce the results that make intuitive sense to you? Now you know why relying on `max` and `min` distances to compute distances between clusters can be problematic.

Not directly related to your question, but if you ever want to make similar heatmaps for RNAseq data, or any other data with a mean-variance relationship, remember that using Euclidian distance is a really really bad idea. For the reasons why, have a look at this blog post by Lior Pachter (auther of cufflinks, among other things).

I didn't mention it, but if you read Lior's blog post that I linked to you'll get a good glimpse at why even using the Euclidean distance is often a bad idea.

I had the same question as this post. So what clustering method and distance matrix to use in order to cluster the upregulated genes together and the downregulated genes together?

Thank you.

So...ignorant question, buteuh, why do you have to ^2 the matrix when using the Ward method?

Apparently you don't have to square things anymore, at least if you use `method="ward.D2"`. See `help(hclust)` and the papers referenced therein for details.

1
Irsan7.2k wrote:

For both rows and columns the dendrogram is made from complete linkage (hclust()) and euclidean distances (dist()) . Many times I find pearson distances more useful. Try that.

0
Mitch Bekritsky1.2k wrote:

I could be wrong, but I think the cluster you were expecting to see is identical to the one you're seeing here. At the branchpoint separating the upper 6 genes from the middle 4 genes, you can flip the branchpoint so that the upper 6 are now in the middle and the middle 4 are now at the top, and it would have the exact same meaning as your current heatmap. Here's a decent description.