Question

Tutorial:Perform Hierarchical Clustering in R with ggdendro Visualization

0

Entering edit mode

22 hours ago

Jeremy ▴ 930

Hierarchical clustering can be an easy way to visualize the similarity between groups, whether those groups are species, genes, or something else.

#Load packages.
library(stats)
library(ggdendro)

#Read in data.
data1 = read.csv('data.csv')

#Create distance matrix.
data_dist = dist(data1, method = 'euclidean')

#Perform hierarchical clustering analysis.
data_clust = hclust(data_dist)

# Add labels for plotting.
data.gg = dendro_data(data_clust)
dict <- setNames(c('Cow', 'Mouse', 'Rabbit', 'Human', 'Manatee', 'Horse', 'Little Brown Bat','Big Brown Bat', 'Rat', 'Dog', 'Goat'), 1:11)
data.gg$labels$label <- sapply(data.gg$labels$label, function(x) dict[[as.character(x)]])

#Plot clusters with ggdendro.
plot1 = ggdendrogram(data.gg, size = 2, rotate = T) 
plot1

Hierarchical Clustering Plot

There are six different methods for making a distance matrix using dist(). You can try all six and compare and contrast the resulting plots.

r statistics ggdendro hierarchical_clustering • 1.1k views

ADD COMMENT • link updated 7 hours ago by Bastien Hervé 6.4k • written 22 hours ago by Jeremy ▴ 930

2

Entering edit mode

Read in data

Add an example of what is in data.csv.

ADD REPLY • link 21 hours ago by GenoMax 153k

1

Entering edit mode

As the whole presentation relies on a distance matrix and elements aggregation by hierarchical clustering, it would be helpful to better understand in which situation a specific distance method is better to use but also which aggregation method would fit best my data (UPGMA, WPGMA, average, complete...).

I have done some cell clusters aggregation in scRNAseq on top markers genes expression in order to aggregate some of my cluster. The hierarchical clustering method used was changing quite a lot the dendrogram.

ADD REPLY • link 12 hours ago by Bastien Hervé 6.4k

1

Entering edit mode

"ward.D2" -- top of the pops in terms of interpretatble clusters from gene expression data in my hands, the rest is...often crytpic.

ADD REPLY • link 10 hours ago by ATpoint 89k

1

Entering edit mode

For what is worth, in a previous project, I have done some tests on different hierarchical clustering methods for single cell RNAseq. I had 27 sub-clusters for a population that I tried to aggregate by hierarchical clustering stopping the aggregation where I was reaching the aggregation point of 2 specific sub-clusters we knew were differents. I used the top 20 markers of each sub-clusters and ran all the hierarchical clustering methods.

In order to discard clustering methods generating specific clusters which are not found in other clustering methods, I created a technical robustness score for each cluster of each method following the following principles:

A cluster is considered as robust if found identical across hierarchical clustering methods
In the best-case scenario, each method will contain one and only one cluster shared across all methods.
Each best match of the cluster in a method in all other methods is considered as its best homologue (Jaccard similarity score of 1).
All potential other matches in each different method will be considered as artifacts and subtracted to the best match score.

For each cluster the mean score across methods is processed. These rules give a technical robustness score to each cluster in each method, where score 1 is a perfect score, presented below:

technical robustness

Most of the methods have high scores overall, but some like ”median” and ”centroid” have some outliers with low scores (which could also be seen from the Jaccard similarity matrix). Some methods yield somewhat similar scores as their mathematical reasoning in this dataset are homologous (average/mcquitty).

I also looked at the biological robustness by comparing differential gene expression between all original clusters inside each aggregated clusters. I was interested in assessing intra-aggregated cluster variation. Here is presented the total number of differentially expressed genes (DEGs) within each aggregated cluster in each method.

biological robustness

Overall, I used these two metrics to select the most suitable hierarchical clustering method for our dataset. All the methods, except “single”, would have been an acceptable clustering method.

ADD REPLY • link 7 hours ago by Bastien Hervé 6.4k