Question

Clustering using codon usage similarity

1

Entering edit mode

7.5 years ago

Saad Khan ▴ 440

Hi,

I have a codon usage similarity matrix that I got from somewhere. Most clustering algorithms start with data which characteristically looks like the iris dataset n rows (as observations) and x columns as features. Most R packages don't start with a distance matrix directly and apply their own distance function on the data like "euclidean", "Minkowski" etc. But Since I am directly starting with a distance matrix I was wondering if someone could provide me insight as to how in the first place cluster the matrix and then get the optimal number of clusters from data. Almost all the methods (R-packages) described here (http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters) do not take/accept distance matrix as input. R packages like dbscan (http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning) do accept input but you have a problem of defining "eps: Reachability maximum distance" and "MinPts: Reachability minimum number of points" beforehand. So I was wondering if anyone who has gone through similar issues can provide me examples and/or workaround to my problem.

codon usage clustering • 2.4k views

ADD COMMENT • link updated 7.4 years ago by Biostar 20 • written 7.5 years ago by Saad Khan ▴ 440

1

Entering edit mode

Hi, I am not sure if I understand it correctly... But if you have your distances already (the similarity matrix) and want to cluster immediately with these (instead of calculating euclidean distances), I think you can use as.dist.

e.g.,

HC <- hclust(as.dist(matrix))
plot(HC)

ADD REPLY • link 7.5 years ago by Benn 8.3k

0

Entering edit mode

Note that as.dist only coerces the matrix into a dist object. The content doesn't 'magically' become interpretable as a distance. If you have a matrix of similarities, you first need to convert it to distances (i.e. dissimilarities). Using a similarity matrix when a distance matrix is expected will usually produce the wrong result because a high distance value means a low similarity and vice versa. There are various ways of converting a similarity into a distance, one is simply D(i,j)=max(S)-S(i,j).

ADD REPLY • link 7.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

How about other more robust methods (r-packages) like K-means, PAM(K-medoids) and mclust etc