Entering edit mode
22 hours ago
Jeremy
▴
930
Hierarchical clustering can be an easy way to visualize the similarity between groups, whether those groups are species, genes, or something else.
#Load packages.
library(stats)
library(ggdendro)
#Read in data.
data1 = read.csv('data.csv')
#Create distance matrix.
data_dist = dist(data1, method = 'euclidean')
#Perform hierarchical clustering analysis.
data_clust = hclust(data_dist)
# Add labels for plotting.
data.gg = dendro_data(data_clust)
dict <- setNames(c('Cow', 'Mouse', 'Rabbit', 'Human', 'Manatee', 'Horse', 'Little Brown Bat','Big Brown Bat', 'Rat', 'Dog', 'Goat'), 1:11)
data.gg$labels$label <- sapply(data.gg$labels$label, function(x) dict[[as.character(x)]])
#Plot clusters with ggdendro.
plot1 = ggdendrogram(data.gg, size = 2, rotate = T)
plot1
There are six different methods for making a distance matrix using dist(). You can try all six and compare and contrast the resulting plots.
Add an example of what is in
data.csv
.As the whole presentation relies on a distance matrix and elements aggregation by hierarchical clustering, it would be helpful to better understand in which situation a specific distance method is better to use but also which aggregation method would fit best my data (UPGMA, WPGMA, average, complete...).
I have done some cell clusters aggregation in scRNAseq on top markers genes expression in order to aggregate some of my cluster. The hierarchical clustering method used was changing quite a lot the dendrogram.
"ward.D2" -- top of the pops in terms of interpretatble clusters from gene expression data in my hands, the rest is...often crytpic.
For what is worth, in a previous project, I have done some tests on different hierarchical clustering methods for single cell RNAseq. I had 27 sub-clusters for a population that I tried to aggregate by hierarchical clustering stopping the aggregation where I was reaching the aggregation point of 2 specific sub-clusters we knew were differents. I used the top 20 markers of each sub-clusters and ran all the hierarchical clustering methods.
In order to discard clustering methods generating specific clusters which are not found in other clustering methods, I created a technical robustness score for each cluster of each method following the following principles:
For each cluster the mean score across methods is processed. These rules give a technical robustness score to each cluster in each method, where score 1 is a perfect score, presented below:
Most of the methods have high scores overall, but some like ”median” and ”centroid” have some outliers with low scores (which could also be seen from the Jaccard similarity matrix). Some methods yield somewhat similar scores as their mathematical reasoning in this dataset are homologous (average/mcquitty).
I also looked at the biological robustness by comparing differential gene expression between all original clusters inside each aggregated clusters. I was interested in assessing intra-aggregated cluster variation. Here is presented the total number of differentially expressed genes (DEGs) within each aggregated cluster in each method.
Overall, I used these two metrics to select the most suitable hierarchical clustering method for our dataset. All the methods, except “single”, would have been an acceptable clustering method.