Question: The number of cluster in Kmean clustering
1
2.4 years ago by
IRAN
Shamim Sarhadi210 wrote:

I have a dataset including 150 genes and 100 sample,I want to cluster my genes with kmean clustering but I don't know about the number of clusters,How can I select the best number?

statistics • 1.2k views
ADD COMMENTlink
modified 21 months ago by Ketil3.9k • written 2.4 years ago by Shamim Sarhadi210
1
2.4 years ago by
Ar790
United States
Ar790 wrote:

Some of the ways to find "k" in k-means are:

Here is the implementation of K-means and elbow method in R

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Ar790

The silhouette approach is implemented in R in the package cluster.

ADD REPLYlink written 2.4 years ago by Jean-Karim Heriche16k

yes and it suggests me the number of two, but Calinsky approach suggests 11

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Shamim Sarhadi210

Thank you Ar,I tried with all methods in this link http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters ,but I got more confused, because each method proposed me a special number, like 2,5,11 !!! Dear Ar I can't upvote your ansver ,I think you should comment on my post then I can upvote your answer

ADD REPLYlink written 2.4 years ago by Shamim Sarhadi210
1

You get a different number of clusters with different methods because they look at different things. However, in the case of very well defined clusters, they would tend to give the same number but this is rather rare because real data is noisy.

ADD REPLYlink written 2.4 years ago by Jean-Karim Heriche16k
1

Perhaps that clustering makes sense at different levels. Maybe at k=2, the hypothetical genes split up by "healthy" and "cancer" labels, say. At k=5, genes split out by two "healthy" subgroups and three "cancer" subgroups. Etc. And these groupings might be biologically relevant in different respects. Further exploration of clusterings is useful to see how and why things are falling out.

ADD REPLYlink modified 21 months ago • written 21 months ago by Alex Reynolds26k
0
2.4 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche16k wrote:

I usually first build a couple of dendrograms with hierarchical clustering using different methods e.g. complete linkage and Ward's linkage to get an idea of the structures present in the data. The problem is that k-means will give you the clusters you requested no matter whether there's structure in the data or not. Once you know there's structure you can either cut the tree or use k-means with the number of clusters found in the dendrogram. Also with 100-dimensional vectors, you should probably not use Euclidean distance if the data is noisy. Finally, there are also a few clustering algorithms that don't require the number of clusters as input e.g. DBSCAN (dbscan package in R) although they often require setting other parameters which are not necessarily easier to estimate.

ADD COMMENTlink written 2.4 years ago by Jean-Karim Heriche16k

Thank you Jean-Karim Heriche so, In your opinion it is better that first I build a dendrogram with hierarchical clustering and then select the number of cluster for kmean clustering, I don't know about mathematics behind these methods very well, In your opinion is it better that I use hierarchical clustering instead of kmean?

ADD REPLYlink written 2.4 years ago by Shamim Sarhadi210
1

It is always good to start with some visual exploration of the data before clustering. The goals are first to find out whether there are some detectable structures and second to try to get an idea of their shapes. This last bit is important because k-means can only find spherical-shaped clusters. Hierarchical clustering is a quick and easy way to go about looking for structures. You could also try various kinds of plots e.g. PCA, MDS. If you can validate the clusters a posteriori then you could try different similarity/distance measures with different clustering approaches and select what gives you the best results.

ADD REPLYlink written 2.4 years ago by Jean-Karim Heriche16k

Thank you for your helpful explanation

ADD REPLYlink written 2.4 years ago by Shamim Sarhadi210
0
21 months ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

There's an interesting modification to k-means where instead of setting the clusters explicitly, you minimize the expression

  $\sum || x_i - \mu_i ||^2 + \sum || \mu_i - \mu_j ||$


(IIRC). The $\mu$s represent cluster centroids, and the minimization forces them to be as few as possible, while minimizing the distance from the data points $x_i$ to the corresponding centroid. I can see if I can find a reference, if it's of interest.

ADD COMMENTlink written 21 months ago by Ketil3.9k
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1860 users visited in the last hour