#Create random data
data <- replicate(20, rnorm(50))
rownames(data) <- paste("Gene", c(1:nrow(data)))
colnames(data) <- paste("Sample", c(1:ncol(data)))
out <- pheatmap(data,
show_rownames=T, cluster_cols=T, cluster_rows=T, scale="row",
cex=1, clustering_distance_rows="euclidean", cex=1,
clustering_distance_cols="euclidean", clustering_method="complete", border_color=FALSE)

#Re-order original data (genes) to match ordering in heatmap (top-to-bottom)
rownames(data[out$tree_row[["order"]],])
[1] "Gene 2" "Gene 12" "Gene 9" "Gene 33" "Gene 29" "Gene 43" "Gene 24"
[8] "Gene 44" "Gene 3" "Gene 21" "Gene 46" "Gene 26" "Gene 20" "Gene 45"
[15] "Gene 18" "Gene 38" "Gene 22" "Gene 1" "Gene 17" "Gene 7" "Gene 6"
[22] "Gene 41" "Gene 30" "Gene 31" "Gene 13" "Gene 16" "Gene 11" "Gene 50"
[29] "Gene 34" "Gene 37" "Gene 15" "Gene 25" "Gene 27" "Gene 39" "Gene 19"
[36] "Gene 35" "Gene 4" "Gene 49" "Gene 10" "Gene 28" "Gene 8" "Gene 14"
[43] "Gene 32" "Gene 42" "Gene 36" "Gene 48" "Gene 23" "Gene 47" "Gene 5"
[50] "Gene 40"
#Re-order original data (samples) to match ordering in heatmap (left-to-right)
colnames(data[,out$tree_col[["order"]]])
[1] "Sample 3" "Sample 6" "Sample 1" "Sample 17" "Sample 8" "Sample 10"
[7] "Sample 15" "Sample 4" "Sample 18" "Sample 2" "Sample 11" "Sample 9"
[13] "Sample 13" "Sample 12" "Sample 7" "Sample 19" "Sample 14" "Sample 16"
[19] "Sample 5" "Sample 20"
-------------------------
If you want something like gene-to-cluster assignment, you can 'cut' your row dendrogram into a pre-selected number of groups as follows:
#2 groups
sort(cutree(out$tree_row, k=2))
Gene 1 Gene 2 Gene 3 Gene 9 Gene 12 Gene 17 Gene 18 Gene 20 Gene 21 Gene 22
1 1 1 1 1 1 1 1 1 1
Gene 24 Gene 26 Gene 29 Gene 33 Gene 38 Gene 43 Gene 44 Gene 45 Gene 46 Gene 4
1 1 1 1 1 1 1 1 1 2
Gene 5 Gene 6 Gene 7 Gene 8 Gene 10 Gene 11 Gene 13 Gene 14 Gene 15 Gene 16
2 2 2 2 2 2 2 2 2 2
Gene 19 Gene 23 Gene 25 Gene 27 Gene 28 Gene 30 Gene 31 Gene 32 Gene 34 Gene 35
2 2 2 2 2 2 2 2 2 2
Gene 36 Gene 37 Gene 39 Gene 40 Gene 41 Gene 42 Gene 47 Gene 48 Gene 49 Gene 50
2 2 2 2 2 2 2 2 2 2
#5 groups
sort(cutree(out$tree_row, k=5))
Gene 1 Gene 3 Gene 17 Gene 18 Gene 20 Gene 21 Gene 22 Gene 24 Gene 26 Gene 38
1 1 1 1 1 1 1 1 1 1
Gene 44 Gene 45 Gene 46 Gene 2 Gene 9 Gene 12 Gene 29 Gene 33 Gene 43 Gene 4
1 1 1 2 2 2 2 2 2 3
Gene 19 Gene 35 Gene 49 Gene 5 Gene 8 Gene 10 Gene 14 Gene 23 Gene 28 Gene 32
3 3 3 4 4 4 4 4 4 4
Gene 36 Gene 40 Gene 42 Gene 47 Gene 48 Gene 6 Gene 7 Gene 11 Gene 13 Gene 15
4 4 4 4 4 5 5 5 5 5
...
You can also cut the tree at a pre-defined tree height, and extract the gene-to-cluster assignments at that height:
plot(out$tree_row)
abline(h=7, col="red", lty=2, lwd=2)

#Cut the row (gene) dendrogram at a Euclidean distance dis-similarity of 8
sort(cutree(out$tree_row, h=7))
Gene 1 Gene 3 Gene 17 Gene 18 Gene 20 Gene 21 Gene 22 Gene 24 Gene 26 Gene 38
1 1 1 1 1 1 1 1 1 1
Gene 44 Gene 45 Gene 46 Gene 2 Gene 9 Gene 12 Gene 29 Gene 33 Gene 43 Gene 4
1 1 1 2 2 2 2 2 2 3
Gene 19 Gene 35 Gene 49 Gene 5 Gene 8 Gene 10 Gene 14 Gene 23 Gene 28 Gene 32
3 3 3 4 4 4 4 4 4 4
Gene 36 Gene 40 Gene 42 Gene 47 Gene 48 Gene 6 Gene 7 Gene 13 Gene 16 Gene 30
4 4 4 4 4 5 5 5 5 5
Gene 31 Gene 41 Gene 11 Gene 15 Gene 25 Gene 27 Gene 34 Gene 37 Gene 39 Gene 50
5 5 6 6 6 6 6 6 6 6
wonderful let me do and i will post it if i have some issues
hi kevin its me again so i have used your complex heatmap code and its working wonderful but i wanted to ask you in context of gplot heatmap.2 function this is my code
and this is the summary of the out object which is
how do i take it out from this object the gene list in the same manner as depicted in heatmap...
the issue want to take out genes which are strikingly different in a one cluster as compared to other..
this
rownames(dat[out$rowInd,])
the order of the gene which are in rows & thiscolnames(dat[out$colInd,])
the column list ,how to i back trace back and find the same what i see it in my list, i could do this with your given complex heat map code which gave me link but since i already did most of them with gplot so im doing with this as of now..Hello friend. I am not sure exactly what you need. If you want the row and column names as per the heatmap, then you can do:
hi kevin ,yes that i can get it but lets say the first two sample Gran1 and Gran1 expression is different from the rest of the sample if you see so there are genes in that region as
"rownames(dat)[out$rowInd]"
gives me the total order of genes but is it possible only to find out genes of that particular regionOh, I see. You can do this with
cutree()
, as follows:That should give your samples as ordered in dat; however, it will assign to each a cluster membership (here, 1 or 2). You can choose the number of clusters by changing the second argument, i.e., for example, if you want 4 clusters then do:
--------------------
For better selection, you can do this:
excellent , the object class is integer how can i pass that into a data frame of course i can copy that from my terminal but is there a way to pass it on to some readable form for downstream purpose
clusDesignation[clusDesignation==2]
I believe that
names(clusDesignation[clusDesignation==2])
should give you the gene / sample names for each cluster ?thanks a ton i have to cross these mundane yet little things to make my life easier ..as always thanks for helping me out
No problem. Apologies for the delay in replying. I was traveling to the other side of the Globe....!
how do you decide what should be the number of k is it visual inspection after heatmap or how can i decide ? as for example I have like 741 genes so it too dense then how to decide the optimum number of k and what tree height to cut ...?
lol - you possibly do not realise the entire area of clustering research that you've opened up with just this single question. There is no correct or incorrect answer.
You can sometimes justify just a visual inspection of the dendrogram and then decide a point to cut using
cutree
withk
orh
, preferablyh
because then you have a cut-off threshold based on the distance metric that you used (usually Euclidean distance or Pearson correlation distance.Note that I recently published a study on this topic but on metabolomics data: Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y
Various methods exist, such as:
Also note that the ComplexHeatmap can incorporate PAM or k-means clustering in order to split your heatmap, or you can define the split yourself, like this: C: how to cluster genes in heatmap
To see how to do this, see also here: Split heatmap by rows
It may be worth opening up new question, as some of my colleagues also work on clustering.
wow looking at your answer it just opened up what are the things are can be done with data ..I am looking into the post first i will apply into my data ..
Yes, with data, the possibilities are endless, but always proceed with caution about results that are seemingly too good to be true (usually they are not true).
Hi Kevin,
This is such a useful tutorial, thank you! I'm using this on my data now but I'm wondering if there's an easy way to figure out what cluster is what. So, let's say I've made a heatmap and then used your code to split it into 4 clusters:
If I have my heatmap split into 4 clusters as well using the argument
How can I tell which of the four clusters in my heatmap are the four clusters in my "clusters" object? If the clusters are vastly different sizes it's pretty easy to figure it out, but if two or more are very similar, I can't think of an easy way to figure it out without labeling some sentinel genes on the heatmap visualization.
For that, I suppose that you could add a row colour bar if using pheatmap. You could also [visually] separate the heatmap based on cluster, which is also possible with pheatmap, I believe.
Another option is ComplexHeatmap, for which I have a tutorial here (using PAM clustering): https://github.com/kevinblighe/E-MTAB-6141
Does the gene group numbers ordering match with the column ordering from heatmap i.e.
colnames(data[,out$tree_col[["order"]]])
? for example, Gene 1 belongs to group number "1", so does Gene 1 belongs to "Sample 3"?How can we get which cluster 1-6 correspond to which branch of the dendogramm?