Clustering using codon usage similarity
0
1
Entering edit mode
5.1 years ago

Hi,

I have a codon usage similarity matrix that I got from somewhere. Most clustering algorithms start with data which characteristically looks like the iris dataset n rows (as observations) and x columns as features. Most R packages don't start with a distance matrix directly and apply their own distance function on the data like "euclidean", "Minkowski" etc. But Since I am directly starting with a distance matrix I was wondering if someone could provide me insight as to how in the first place cluster the matrix and then get the optimal number of clusters from data. Almost all the methods (R-packages) described here (http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters) do not take/accept distance matrix as input. R packages like dbscan (http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning) do accept input but you have a problem of defining "eps: Reachability maximum distance" and "MinPts: Reachability minimum number of points" beforehand. So I was wondering if anyone who has gone through similar issues can provide me examples and/or workaround to my problem.

codon usage clustering • 1.7k views
1
Entering edit mode

Hi, I am not sure if I understand it correctly... But if you have your distances already (the similarity matrix) and want to cluster immediately with these (instead of calculating euclidean distances), I think you can use as.dist.

e.g.,

HC <- hclust(as.dist(matrix))
plot(HC)

0
Entering edit mode

Note that as.dist only coerces the matrix into a dist object. The content doesn't 'magically' become interpretable as a distance. If you have a matrix of similarities, you first need to convert it to distances (i.e. dissimilarities). Using a similarity matrix when a distance matrix is expected will usually produce the wrong result because a high distance value means a low similarity and vice versa. There are various ways of converting a similarity into a distance, one is simply D(i,j)=max(S)-S(i,j).

0
Entering edit mode

How about other more robust methods (r-packages) like K-means, PAM(K-medoids) and mclust etc

0
Entering edit mode

I confirm the first comment but we shoold respect the distance object forma