distance matrix as input in R clustering functions ?
1
0
Entering edit mode
6 weeks ago

Hello I want to figure out if there are genetic clusters on a time series samples (93 samples), I used mash(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) to generate a distance matrix of 93x93 tha looks like following one:

    A   B   C   D   E   F   G   H   I   J   K   L
A   0  20  20  20  40  60  60  60 100 120 120 120
B  20   0  20  20  60  80  80  80 120 140 140 140
C  20  20   0  20  60  80  80  80 120 140 140 140
D  20  20  20   0  60  80  80  80 120 140 140 140
E  40  60  60  60   0  20  20  20  60  80  80  80
F  60  80  80  80  20   0  20  20  40  60  60  60
G  60  80  80  80  20  20   0  20  60  80  80  80
H  60  80  80  80  20  20  20   0  60  80  80  80
I 100 120 120 120  60  40  60  60   0  20  20  20
J 120 140 140 140  80  60  80  80  20   0  20  20
K 120 140 140 140  80  60  80  80  20  20   0  20
L 120 140 140 140  80  60  80  80  20  20  20   0


how can I input this matrix on a clustering algorithm ? I used kmeans funciton in R getting clusters but it might not be a good idea to cluster data on this function with a distance matrix as input (kmeans function calculates distances using different methods)

In other words,is there any clustering function in R that supports a distance matrix as input?

I was taking a look to this one as a feasible clustering algorithm for my dataset: https://onlinelibrary.wiley.com/doi/10.1002/9780470316801.ch3 (k-medioids for large datasets) but dont know if it is possible to input a distance matrix

k-medioids-genetic-distances-mash R Clustering • 271 views
3
Entering edit mode
5 weeks ago
basuanubhav ▴ 80

Hey, to my knowledge, the R function hclust is able to generate clustering from a distance matrix as input such as the matrix produced by the dist function in R.

Let me know if it helps, Cheers!

1
Entering edit mode

You can convert a matrix of distances M into a dist object with as.dist(M). Using hclust is generally a good idea to start exploring cluster structure, e.g.

hclust(as.dist(M))

0
Entering edit mode

yeah it worked! thanks, now I'm trying to figure out if there is a method for estimating the proper number of clusters to loof for in a dendogram, do you know any?

0
Entering edit mode

What a cluster is is in the eye of the beholder. There is no good answer to this question. It often depends on the granularity we want to have, e.g. do we want to have a cluster of all blood cells or do we want to separate red from white blood cells?
You can try the dynamicTreeCut package to find clusters in a dendrogram. If no obvious structure is visible in the dendrogram you may want to explore the underlying feature space a bit more, for example with dimensionality reduction methods: where do the sample fall when plotted on the first two PCA components? Does UMAP/t-SNE reveal any meaningful clustering? Note that you can also do clustering in a reduced dimensionality space.