extract dendrogram cluster from pheatmap
2
10
Entering edit mode
3.9 years ago
krushnach80 ▴ 950

This is a post from stackoverflow here they show how to extract dedrogram such in form of respective cluster but this is with heatmap.2 function , i m trying to do the same with pheatmap , let's say a group of genes are forming clusters then i want to extract them

Im doing something like this in my code but its giving me all of them togethter not cluster wise as seen in the heatmap

out = pheatmap(data,
color=macolor,

show_rownames = F,cluster_cols=T,cluster_rows=T, scale="row",
cex=1,clustering_distance_rows = "euclidean", cex=1,
clustering_distance_cols = "euclidean", clustering_method = "complete",border_color = FALSE)

res <- data[c(out$tree_row[["order"]]),out$tree_col[["order"]]]


any suggestion or help would be highly appreciated.

R • 28k views
27
Entering edit mode
3.9 years ago
#Create random data
data <- replicate(20, rnorm(50))
rownames(data) <- paste("Gene", c(1:nrow(data)))
colnames(data) <- paste("Sample", c(1:ncol(data)))

out <- pheatmap(data,
show_rownames=T, cluster_cols=T, cluster_rows=T, scale="row",
cex=1, clustering_distance_rows="euclidean", cex=1,
clustering_distance_cols="euclidean", clustering_method="complete", border_color=FALSE)


#Re-order original data (genes) to match ordering in heatmap (top-to-bottom)
rownames(data[out$tree_row[["order"]],]) [1] "Gene 2" "Gene 12" "Gene 9" "Gene 33" "Gene 29" "Gene 43" "Gene 24" [8] "Gene 44" "Gene 3" "Gene 21" "Gene 46" "Gene 26" "Gene 20" "Gene 45" [15] "Gene 18" "Gene 38" "Gene 22" "Gene 1" "Gene 17" "Gene 7" "Gene 6" [22] "Gene 41" "Gene 30" "Gene 31" "Gene 13" "Gene 16" "Gene 11" "Gene 50" [29] "Gene 34" "Gene 37" "Gene 15" "Gene 25" "Gene 27" "Gene 39" "Gene 19" [36] "Gene 35" "Gene 4" "Gene 49" "Gene 10" "Gene 28" "Gene 8" "Gene 14" [43] "Gene 32" "Gene 42" "Gene 36" "Gene 48" "Gene 23" "Gene 47" "Gene 5" [50] "Gene 40" #Re-order original data (samples) to match ordering in heatmap (left-to-right) colnames(data[,out$tree_col[["order"]]])
[1] "Sample 3"  "Sample 6"  "Sample 1"  "Sample 17" "Sample 8"  "Sample 10"
[7] "Sample 15" "Sample 4"  "Sample 18" "Sample 2"  "Sample 11" "Sample 9"
[13] "Sample 13" "Sample 12" "Sample 7"  "Sample 19" "Sample 14" "Sample 16"
[19] "Sample 5"  "Sample 20"


## -------------------------

If you want something like gene-to-cluster assignment, you can 'cut' your row dendrogram into a pre-selected number of groups as follows:

#2 groups
sort(cutree(out$tree_row, k=2)) Gene 1 Gene 2 Gene 3 Gene 9 Gene 12 Gene 17 Gene 18 Gene 20 Gene 21 Gene 22 1 1 1 1 1 1 1 1 1 1 Gene 24 Gene 26 Gene 29 Gene 33 Gene 38 Gene 43 Gene 44 Gene 45 Gene 46 Gene 4 1 1 1 1 1 1 1 1 1 2 Gene 5 Gene 6 Gene 7 Gene 8 Gene 10 Gene 11 Gene 13 Gene 14 Gene 15 Gene 16 2 2 2 2 2 2 2 2 2 2 Gene 19 Gene 23 Gene 25 Gene 27 Gene 28 Gene 30 Gene 31 Gene 32 Gene 34 Gene 35 2 2 2 2 2 2 2 2 2 2 Gene 36 Gene 37 Gene 39 Gene 40 Gene 41 Gene 42 Gene 47 Gene 48 Gene 49 Gene 50 2 2 2 2 2 2 2 2 2 2 #5 groups sort(cutree(out$tree_row, k=5))
Gene 1  Gene 3 Gene 17 Gene 18 Gene 20 Gene 21 Gene 22 Gene 24 Gene 26 Gene 38
1       1       1       1       1       1       1       1       1       1
Gene 44 Gene 45 Gene 46  Gene 2  Gene 9 Gene 12 Gene 29 Gene 33 Gene 43  Gene 4
1       1       1       2       2       2       2       2       2       3
Gene 19 Gene 35 Gene 49  Gene 5  Gene 8 Gene 10 Gene 14 Gene 23 Gene 28 Gene 32
3       3       3       4       4       4       4       4       4       4
Gene 36 Gene 40 Gene 42 Gene 47 Gene 48  Gene 6  Gene 7 Gene 11 Gene 13 Gene 15
4       4       4       4       4       5       5       5       5       5
...


You can also cut the tree at a pre-defined tree height, and extract the gene-to-cluster assignments at that height:

plot(out$tree_row) abline(h=7, col="red", lty=2, lwd=2)  #Cut the row (gene) dendrogram at a Euclidean distance dis-similarity of 8 sort(cutree(out$tree_row, h=7))
Gene 1  Gene 3 Gene 17 Gene 18 Gene 20 Gene 21 Gene 22 Gene 24 Gene 26 Gene 38
1       1       1       1       1       1       1       1       1       1
Gene 44 Gene 45 Gene 46  Gene 2  Gene 9 Gene 12 Gene 29 Gene 33 Gene 43  Gene 4
1       1       1       2       2       2       2       2       2       3
Gene 19 Gene 35 Gene 49  Gene 5  Gene 8 Gene 10 Gene 14 Gene 23 Gene 28 Gene 32
3       3       3       4       4       4       4       4       4       4
Gene 36 Gene 40 Gene 42 Gene 47 Gene 48  Gene 6  Gene 7 Gene 13 Gene 16 Gene 30
4       4       4       4       4       5       5       5       5       5
Gene 31 Gene 41 Gene 11 Gene 15 Gene 25 Gene 27 Gene 34 Gene 37 Gene 39 Gene 50
5       5       6       6       6       6       6       6       6       6

1
Entering edit mode

wonderful let me do and i will post it if i have some issues

1
Entering edit mode

hi kevin its me again so i have used your complex heatmap code and its working wonderful but i wanted to ask you in context of gplot heatmap.2 function this is my code

out <-heatmap.2dat.tn,Rowv=as.dendrogram(c2),Colv = as.dendrogram(c1),col =macolor
,scale="row", margin=c(6, 4),trace='none',labRow = FALSE,
symkey=FALSE,symbreaks=FALSE,dendrogram="col",
density.info='density', denscol="red",lhei=c(1,3),cexRow = .5,cexCol = 1,
lwid=c(.9,3), keysize=0.1, key.par = list(cex=0.5))


and this is the summary of the out object which is

summary(out)
Length Class      Mode
rowInd          389  -none-     numeric
colInd           34  -none-     numeric
call             20  -none-     call
rowMeans        389  -none-     numeric
rowSDs          389  -none-     numeric
carpet        13226  -none-     numeric
rowDendrogram     2  dendrogram list
colDendrogram     2  dendrogram list
breaks          101  -none-     numeric
col             100  -none-     character
colorTable        3  data.frame list
layout            3  -none-     list


how do i take it out from this object the gene list in the same manner as depicted in heatmap...

the issue want to take out genes which are strikingly different in a one cluster as compared to other..

this rownames(dat[out$rowInd,]) the order of the gene which are in rows & this colnames(dat[out$colInd,]) the column list ,how to i back trace back and find the same what i see it in my list, i could do this with your given complex heat map code which gave me link but since i already did most of them with gplot so im doing with this as of now..

1
Entering edit mode

Hello friend. I am not sure exactly what you need. If you want the row and column names as per the heatmap, then you can do:

colnames(dat)[out$colInd] rownames(dat)[out$rowInd]

0
Entering edit mode

hi kevin ,yes that i can get it but lets say the first two sample Gran1 and Gran1 expression is different from the rest of the sample if you see so there are genes in that region as "rownames(dat)[out$rowInd]" gives me the total order of genes but is it possible only to find out genes of that particular region ADD REPLY 1 Entering edit mode Oh, I see. You can do this with cutree(), as follows: cutree(as.hclust(out$colDendrogram), 2)


That should give your samples as ordered in dat; however, it will assign to each a cluster membership (here, 1 or 2). You can choose the number of clusters by changing the second argument, i.e., for example, if you want 4 clusters then do:

cutree(as.hclust(out$colDendrogram), 4)  ## -------------------- For better selection, you can do this: clusDesignation <- cutree(as.hclust(out$colDendrogram), 2)
clusDesignation[clusDesignation==1]
clusDesignation[clusDesignation==2]

0
Entering edit mode

excellent , the object class is integer how can i pass that into a data frame of course i can copy that from my terminal but is there a way to pass it on to some readable form for downstream purpose clusDesignation[clusDesignation==2]

1
Entering edit mode

I believe that names(clusDesignation[clusDesignation==2]) should give you the gene / sample names for each cluster ?

0
Entering edit mode

thanks a ton i have to cross these mundane yet little things to make my life easier ..as always thanks for helping me out

1
Entering edit mode

No problem. Apologies for the delay in replying. I was traveling to the other side of the Globe....!

0
Entering edit mode

how do you decide what should be the number of k is it visual inspection after heatmap or how can i decide ? as for example I have like 741 genes so it too dense then how to decide the optimum number of k and what tree height to cut ...?

2
Entering edit mode

how do you decide what should be the number of k is it visual inspection after heatmap or how can i decide ?

lol - you possibly do not realise the entire area of clustering research that you've opened up with just this single question. There is no correct or incorrect answer.

You can sometimes justify just a visual inspection of the dendrogram and then decide a point to cut using cutree with k or h, preferably h because then you have a cut-off threshold based on the distance metric that you used (usually Euclidean distance or Pearson correlation distance.

Note that I recently published a study on this topic but on metabolomics data: Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y

Various methods exist, such as:

• Gap statistic
• Elbow method
• Silhouette method
• Consensus clustering

Also note that the ComplexHeatmap can incorporate PAM or k-means clustering in order to split your heatmap, or you can define the split yourself, like this: C: how to cluster genes in heatmap

To see how to do this, see also here: Split heatmap by rows

It may be worth opening up new question, as some of my colleagues also work on clustering.

0
Entering edit mode

wow looking at your answer it just opened up what are the things are can be done with data ..I am looking into the post first i will apply into my data ..

1
Entering edit mode

Yes, with data, the possibilities are endless, but always proceed with caution about results that are seemingly too good to be true (usually they are not true).

0
Entering edit mode

Hi Kevin,

This is such a useful tutorial, thank you! I'm using this on my data now but I'm wondering if there's an easy way to figure out what cluster is what. So, let's say I've made a heatmap and then used your code to split it into 4 clusters:

clusters <- rownames(subsetDat[heat$tree_row[["order"]],]) clusters <- sort(cutree(heat$tree_row,k=4))


If I have my heatmap split into 4 clusters as well using the argument

cutree_rows = 4


How can I tell which of the four clusters in my heatmap are the four clusters in my "clusters" object? If the clusters are vastly different sizes it's pretty easy to figure it out, but if two or more are very similar, I can't think of an easy way to figure it out without labeling some sentinel genes on the heatmap visualization.

1
Entering edit mode

For that, I suppose that you could add a row colour bar if using pheatmap. You could also [visually] separate the heatmap based on cluster, which is also possible with pheatmap, I believe.

Another option is ComplexHeatmap, for which I have a tutorial here (using PAM clustering): https://github.com/kevinblighe/E-MTAB-6141

0
Entering edit mode

Does the gene group numbers ordering match with the column ordering from heatmap i.e. colnames(data[,out\$tree_col[["order"]]])? for example, Gene 1 belongs to group number "1", so does Gene 1 belongs to "Sample 3"?

2
Entering edit mode
16 months ago

If you add an argument for "kmeans_k" in your pheatmap command, then you can use a very simple option, below:

out = pheatmap(data, color=macolor, kmeans_k = 10, show_rownames = F,cluster_cols=T,cluster_rows=T, scale="row", cex=1,clustering_distance_rows = "euclidean", cex=1, clustering_distance_cols = "euclidean", clustering_method = "complete",border_color = FALSE)

geneclusters <- out[["kmeans"]][["cluster"]]

geneclusters[geneclusters==5]

geneclusters[geneclusters==1]


etc.

0
Entering edit mode

Thank you for the answer; however, can you please elaborate and provide some context? Your solution will only work when kmeans_k is activated, correct?

1
Entering edit mode

Yes, edited above. Thank you.