Question

Optimal K-means cluster # for large data (R)

0

Entering edit mode

4.0 years ago

crcarroll ▴ 90

Background: I have multiple data matrices, each with 2,000 columns and ~21,0000 rows. I am performing a K-means analysis and then producing a heatmap of the cluster data. I am working in R.

Problem: Rather than just pre-selecting a K-means cluster number and through trial and error choosing which plot "looks best", I am trying to use a tool that will perform something like the elbow or silhouette methods to determine optimal cluster number. I have tried nclust (prior to implementing nclust, I have used the amap package to calculate the distance matrix). My problem is that after ~5 hours it doesn't finish running. I receive no errors or warnings. I'm remoting into a server for this data; eventually I lose connection anyway so I can't wait many hours, besides the practical consideration.

Question: Is there a practical solution or tool that can handle a large matrix for determining optimal cluster # for a k-means analysis?

R kmeans • 2.5k views

ADD COMMENT • link 4.0 years ago by crcarroll ▴ 90

score 0 · Answer 1 · 2020-05-05

0

Entering edit mode

4.0 years ago

piyushjo ▴ 700

What about gap statistic. clusgap() function of cluster package. I think in single cell data, you first obtain PCA and then use those reduced dimension for k-mean clustering. Look at the below link, this is for single cell, but I am sure can be applied to your problem.

https://osca.bioconductor.org/clustering.html

ADD COMMENT • link 4.0 years ago by piyushjo ▴ 700

0

Entering edit mode

Thank you, I will look into this.

ADD REPLY • link 4.0 years ago by crcarroll ▴ 90

1

Entering edit mode

Coincidentally, there was a recent similar question on Bioconductor: Question: Optimal cluster number identification using buildSNNgraph and igraph clusters

Regarding cluster::clusGap(), it can be terribly slow - I have enabled it for parallel processing:

https://github.com/kevinblighe/clusGapKB

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Actually that is my question on bioconductor. I wanted to ask Aaron about this stuff as it was related to his scran package. The OP here was just quicker than me, however, this question was still about k-means and I was more interested in optimizing graph based clustering. Sorry if it felt like a double post!

ADD REPLY • link 4.0 years ago by piyushjo ▴ 700

1

Entering edit mode

No, no problem. The question in this thread here is from crcarroll. As I mentioned in my own answer on Bioconductor, there's no right or wrong answer. Aaron's knowledge definitely supersedes mine in this area though.

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k