Question: K-means With Many Clusters: Quick-TRANSfer steps exceeded
0
gravatar for Ark
4 months ago by
Ark70
US
Ark70 wrote:

Hello!

I am attempting to run the default kmeans function in RStudio on TPM-normalized and feature-scaled RNA-seq data. I am varying the number of centers I present to kmeans (logarithmically increasing 'k' from 1 - 10,000) but I have hit a snag.

I am calling the kmeans function in the following way:

clust <- kmeans(my_data, k, iter.max = 40)

And I have found that unless I specify a particular seed value I receive the warning:

Quick-TRANSfer stage steps exceeded maximum (= 729900)

This also happens if I increase the number of starts (nstarts) from the default of 1.

I believe that I know why this is happening, but I am not entirely sure, and even if my hunch is correct I still don't know how to fix this.

What I think is happening:

I believe this error is happening because there are too many points that are too similar in value, and therefore, kmeans is having difficulty trying to place the points in one particular cluster. Basically, I think that some points are being assigned back and forth between clusters without ever "settling" on one cluster in particular.

What I have tried to fix the problem:

  • I have tried different seed values and they seem to produce the warning randomly
  • I have tried to vary the iter.max value (from the default of 10 up to a max of 80) without any luck
  • I have tried calling the garbage collector (gc()) before the kmeans function as some users had reported the warning disappearing after clearing memory, but this did not work for me
  • I have tried using a different algorithm (Lloyd), however even with iter.max set to 80, it still failed to converge. On top of that, I would really prefer to use H-W if at all possible as I am analyzing the way kmeans is generally used and therefore need to stay close to the default settings

I am not sure what else I can try to resolve the issue. Any suggestions would be appreciated!

Thank you!

rna-seq kmeans R • 427 views
ADD COMMENTlink modified 4 months ago by Chirag Parsania1.4k • written 4 months ago by Ark70
1
gravatar for Ark
4 months ago by
Ark70
US
Ark70 wrote:

I have been working with this some more and have come to the conclusion that this particular warning is practically unavoidable when I push the number of clusters as high as I am. From what I have read, with an extreme number of clusters and very similar values among the data (many practically equivalent), the algorithm will have trouble converging in a reasonable amount of time. I think my initial hunch in the original post was correct.

For anyone with the same issue: My solution was simply to run many iterations for all desired numbers of clusters (I did 10 per k value) and to completely disregard those that return an "ifault" value of 4. This value indicates that the algorithm couldn't converge in what it considers a reasonable amount of time. Admittedly, as k increases, kmeans takes longer and longer to run and compounding that with many iterations is not ideal. However, I have not found another way around this particular issue in the extreme cases where very large numbers of clusters need to be used. Using another algorithm may help (Lloyds, Macqueen, etc.) but in my case, I really needed to use the Hartigan-Wong algorithm.

Thanks for anyone who read! I'm sure I'll have more questions for you all soon!

ADD COMMENTlink modified 4 months ago • written 4 months ago by Ark70
1

check also https://stackoverflow.com/questions/21382681/kmeans-quick-transfer-stage-steps-exceeded-maximum

ADD REPLYlink written 4 months ago by Santosh Anand4.6k

Yes, thank you. That was basically what I assumed my issue was. The solution they proposed for using a different algorithm was not applicable in my case. Also, I do actually want an extreme number of clusters, as I am running some tests that need both very low and very high numbers of clusters.

Thank you for your reply!

ADD REPLYlink written 4 months ago by Ark70
1
gravatar for Chirag Parsania
4 months ago by
Chirag Parsania1.4k
University of Macau
Chirag Parsania1.4k wrote:

Hi,

You can try one of the recently published clustering method "Clust". Here is the paper . In this method, user do not need to define number of clusters. Method itself detects number of clusters and also removes observations which does not contribute to the variability across the samples. Method also have online version. User just need to upload the matrix in .txt file

ADD COMMENTlink written 4 months ago by Chirag Parsania1.4k

Thanks for the link! This looks interesting and I will definitely try it out on my data!

ADD REPLYlink written 4 months ago by Ark70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2342 users visited in the last hour