Question: Retrieving individual clusters from RNA-seq hierarchical clustering in R
gravatar for dtatarak
2.1 years ago by
dtatarak10 wrote:

I have a dataset which consists of 3 groups of replicates the 1st and 2nd with 3 replicates each and the 3rd group with 4 replicates. I performed DE gene expression analysis using edgeR, and obtained a list of statistically significant genes in the form of a table of gene names and TPMs. I created a data matrix and log2-transformed it with the following code:

z <- data.frame(read.table("/Volumes/David_seq/RNAseq_data/5-1_2018_lmo7aMO aggregates bulk RNA-seq/DE_data/Sorted_NCCs_top800deGenes.txt", header = TRUE, sep = "\t"))

    row.names(z) <- z$Gene

    z <- z[,2:11]

z_matrix <- data.matrix(z)

z_log2matrix <- log2(z_matrix)

Next I performed hierarchical clustering with centroid linkeage and generated a heatmap.

    cor_t <- 1 - cor(t(z_log2matrix))

    distancet <- as.dist(cor_t)

    hclust_centroid <- hclust(distancet, method = "centroid")

    dendcentroid <- as.dendrogram(hclust_centroid)

    heatmap.2(z_log2normMatrix, Rowv = dendcomplete, Colv = TRUE, scale = "row", col = redblue(256), trace = "none", 

    dendrogram = "both", cexCol = 0.5, = "none", labRow = NA)

I know there are many ways to perform clustering, and I'm still a newbie. However, I was pleased with the initial results in that it provided me with the information I was looking for. In short, I'm looking for clusters of genes that are upregulated in the 1st group of replicates as compared to the 3rd group (WT), but not upregulated in the 2nd group as compared to the 3rd group (WT).

I don't know where to go from here. How do I pull out those clusters and put them into their own matrices?

edit: I'm not sure how to insert images, so I've included a URL to the heatmap so you can see what I'm talking about.

clustering rna-seq hierarchical R • 1.7k views
ADD COMMENTlink modified 2.1 years ago by Kevin Blighe61k • written 2.1 years ago by dtatarak10
gravatar for Kevin Blighe
2.1 years ago by
Kevin Blighe61k
Kevin Blighe61k wrote:

If you can convince yourself to switch to using pheatmap(), then there's a ready made solution for you here: A: extract dendrogram cluster from pheatmap

As you are supplying your own dendrogram to to the heatmap.2() function, however, then it may not matter. All that you need to do is use the cutree() function on your dendrogram, and it will break it up into clusters for you, similar to what I have done in the pheatmap question/solution.

More information on cutree:


ADD COMMENTlink written 2.1 years ago by Kevin Blighe61k

Hi Kevin,

Thanks for the advice! I am looking into cutree() as you suggested. I am wondering: what kind of output does it give you? What I ultimately want is a table or matrix with gene names and TPMs for each cluster. Also, it seems you can specify manually how you want it to divide up the clusters, but I don't understand how I would decide those location values. What exactly is meant by "height"?



ADD REPLYlink written 2.1 years ago by dtatarak10

The tree height gives an indication of 'distance' / 'dis-similarity' between the units being clustered in a dendrogram. Here's the image from the other question:

upload pictures

The tree heights (at left) are measured as Euclidean distances in this case. Other possible metrics include Canberra and Manhattan Distance, correlation distance, et cetera. The max distance is roughly 8, in this case, meaning that the 2 most dis-similar genes in this dendrogram are dis-similar by a Euclidean distance of 8. If we cut the tree at a height of Euclidean distance 7.0 (red line), then we are segregating the genes into clusters that are dis-similar at a minimum Euclidean distance of 7.0 - so, quite well segregated. If we cut the tree at lower heights (i.e. lower levels of dis-similarity), then our clusters become less dis-similar and more similar.

In your example, you appear to be using 1 minus Pearson correlation as your distance metric, and then centroid (UPGMC) as your agglomeration function. Your resulting dendrogram clearly has 3 different groups of genes. If you applied cutree to that, you may not get exactly what you want. I would encourage you to retry your hclust function with ward.D2 as the agglomeration function, and then see if that gives a more evenly-distributed dendrogram, on which you could then apply cutree.

Of course, you don't have to use cutree here. It really depends on what exactly you want to do with the information presented to you in the heatmap. In other studies, for example, I have done pathway analysis on the different clusters of genes.

ADD REPLYlink written 2.1 years ago by Kevin Blighe61k

Hi Kevin,

Thanks again for the explanation. I tried what you said, redoing the clustering using ward.D2 as the agglomeration and did get a smoother dendrogram. As you say, there are clearly 3 groups, but within those are some more subtle clusters that represent the groups I am interested in. Specifically, if you look at the heatmap, you'll see groups at the very top and bottom that are up in the first set of replicates only and down in the first set of replicates only, respectively. these are the clusters I want to pull out.

Pathway analysis is actually exactly what I want to do once I've extracted those clusters :) would you be able to point me towards a pipeline for doing this as well? Thanks so much!

enter image description here

ADD REPLYlink written 2.1 years ago by dtatarak10

Yes, that's generally the result of using Ward's linkage, but there's not exactly a right or wrong in terms of which distance metric and linkage method to use. Some are more suitable for certain types of data, though.

In terms of pathway analysis itself, there is no real standard way to do it. Some people do gene enrichment and pathway analysis via DAVID Functional Annotation, which you may have used? Another option for KEGG-specific pathways includes KEGGprofile.

Usually one would also have clinical data against which you could compare your clusters, too, but that's usually for the samples and not the genes, i.e., this cluster of samples exhibits a high inflammatory response, whereas this cluster is immuno-suppressed, based on blood cytokine and other markers like CRP and ESR.

Back when I was doing something similar, a colleague ad a licence for 'Ingenuity', a commercial tool, but it is costly.

Another tool that would be worth exploring is GSVA, but I find it's implementation a bit cumbersome.

ADD REPLYlink written 2.1 years ago by Kevin Blighe61k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 574 users visited in the last hour