Question

The number of cluster in Kmean clustering

1

Entering edit mode

7.8 years ago

Shamim Sarhadi ▴ 220

I have a dataset including 150 genes and 100 sample,I want to cluster my genes with kmean clustering but I don't know about the number of clusters,How can I select the best number?

statistics • 3.5k views

ADD COMMENT • link updated 7.2 years ago by Ketil 4.1k • written 7.8 years ago by Shamim Sarhadi ▴ 220

score 1 · Answer 1 · 2016-06-19

1

Entering edit mode

7.8 years ago

Ar ★ 1.1k

Some of the ways to find "k" in k-means are:

Here is the implementation of K-means and elbow method in R

ADD COMMENT • link 7.8 years ago by Ar ★ 1.1k

0

Entering edit mode

The silhouette approach is implemented in R in the package cluster.

ADD REPLY • link 7.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

yes and it suggests me the number of two, but Calinsky approach suggests 11

ADD REPLY • link 7.8 years ago by Shamim Sarhadi ▴ 220

0

Entering edit mode

Thank you Ar,I tried with all methods in this link http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters ,but I got more confused, because each method proposed me a special number, like 2,5,11 !!! Dear Ar I can't upvote your ansver ,I think you should comment on my post then I can upvote your answer

ADD REPLY • link 7.8 years ago by Shamim Sarhadi ▴ 220

1

Entering edit mode

You get a different number of clusters with different methods because they look at different things. However, in the case of very well defined clusters, they would tend to give the same number but this is rather rare because real data is noisy.

ADD REPLY • link 7.8 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Perhaps that clustering makes sense at different levels. Maybe at k=2, the hypothetical genes split up by "healthy" and "cancer" labels, say. At k=5, genes split out by two "healthy" subgroups and three "cancer" subgroups. Etc. And these groupings might be biologically relevant in different respects. Further exploration of clusterings is useful to see how and why things are falling out.

ADD REPLY • link 7.2 years ago by Alex Reynolds 35k

score 0 · Answer 2 · 2016-06-19

0

Entering edit mode

7.8 years ago

Jean-Karim Heriche 27k

I usually first build a couple of dendrograms with hierarchical clustering using different methods e.g. complete linkage and Ward's linkage to get an idea of the structures present in the data. The problem is that k-means will give you the clusters you requested no matter whether there's structure in the data or not. Once you know there's structure you can either cut the tree or use k-means with the number of clusters found in the dendrogram. Also with 100-dimensional vectors, you should probably not use Euclidean distance if the data is noisy. Finally, there are also a few clustering algorithms that don't require the number of clusters as input e.g. DBSCAN (dbscan package in R) although they often require setting other parameters which are not necessarily easier to estimate.

ADD COMMENT • link 7.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you Jean-Karim Heriche so, In your opinion it is better that first I build a dendrogram with hierarchical clustering and then select the number of cluster for kmean clustering, I don't know about mathematics behind these methods very well, In your opinion is it better that I use hierarchical clustering instead of kmean?

ADD REPLY • link 7.8 years ago by Shamim Sarhadi ▴ 220

1

Entering edit mode

It is always good to start with some visual exploration of the data before clustering. The goals are first to find out whether there are some detectable structures and second to try to get an idea of their shapes. This last bit is important because k-means can only find spherical-shaped clusters. Hierarchical clustering is a quick and easy way to go about looking for structures. You could also try various kinds of plots e.g. PCA, MDS. If you can validate the clusters a posteriori then you could try different similarity/distance measures with different clustering approaches and select what gives you the best results.

ADD REPLY • link 7.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you for your helpful explanation

ADD REPLY • link 7.8 years ago by Shamim Sarhadi ▴ 220

score 0 · Answer 3 · 2017-02-08

There's an interesting modification to k-means where instead of setting the clusters explicitly, you minimize the expression

  $\sum || x_i - \mu_i ||^2 + \sum || \mu_i - \mu_j ||$

(IIRC). The $\mu$s represent cluster centroids, and the minimization forces them to be as few as possible, while minimizing the distance from the data points $x_i$ to the corresponding centroid. I can see if I can find a reference, if it's of interest.