**190**wrote:

I am not sure whether this would be more appropriate on a stats forum, but I feel some here may have knowledge of this.

It is said that K-means clustering "does not work well with non-globular clusters."

However, is this a hard-and-fast rule - or is it that it does not often work?

I have a 2-d data set (specifically depth of coverage and breadth of coverage of genome sequencing reads across different genomic regions). In short, I am expecting two clear groups from this dataset (with notably different depth of coverage and breadth of coverage) and by defining the two groups I can avoid having to make an arbitrary cut-off between them.

The procedure appears to successfully identify the two expected groupings, however the clusters are clearly not globular. Is this a valid application? I am not sure whether I am violating any assumptions (if there are any?), or whether it is just that k-means often does not work with non-spherical data clusters.

If the question being asked is, is there a depth and breadth of coverage associated with each group which means the data can be partitioned such that the means of the members of the groups are closer for the two parameters to members within the same group than between groups, then the answer appears to be yes. But is it valid? Or is it simply, if it works, then it's ok?

(Apologies, I am very much a stats novice.)

Edit: below is a visual of the clusters. The breadth of coverage is 0 to 100 % of the region being considered. The depth is 0 to essentially infinity (I have log transformed this parameter as some regions of the genome are repetitive, so reads from other areas of the genome may map to it resulting in very high depth - again, please correct me if this is not the way to go in a statistical sense prior to clustering).

As you can see the red cluster is now reasonably compact thanks to the log transform, however the yellow (gold?) cluster is not. Despite this, without going into detail the two groups make biological sense (both given their resulting members and the fact that you would expect two distinct groups prior to the test), so given that the result of clustering maximizes the between group variance, surely this is the best place to make the cut-off between those tending towards zero coverage (will never be exactly zero due to incorrect mapping of reads?) and those with distinctly higher breadth/depth of coverage. The algorithm converges very quickly <10 iterations.

Thanks

**3.4k**• written 5.5 years ago by jomaco •

**190**