Question

deepTools/Galaxy Optimize K-Means Clusters - Elbow Plot

0

Entering edit mode

4.3 years ago

Ready2Rapture ▴ 20

Hello,

I'm trying to optimize my deeptools computeMatrix output for k-means clustering, but cannot properly generate an elbow plot.

I've tried loading the matrix as described in this post and then attempting to plot wss in R via:

m = read.delim("computeMatrixOperations.mat.gz", skip=1, header=F)
m = as.matrix(m[,-c(1:6)])

set.seed(123)
# Compute and plot wss for k = 2 to k = 15.
k.max <- 15
wss <- sapply(1:k.max, function(k){kmeans(m, k, nstart=50,iter.max = 15 )$tot.withinss})

plot(1:k.max, wss, type="b", pch = 19, frame = FALSE, xlab="Number of clusters K",ylab="Total within-clusters sum of squares")

But this is too computationally heavy for a ~180,000 x 720 matrix (even using c5n.18xlarge: 72 vCPUs + 192 GiB memory for a few hours) and perhaps incorrect. I have some more ideas on how this might be computed (e.g. with the .tab output) but ANY help would be appreciated since testing is rather computationally and time intensive.

I've also been experimenting with profileplyr which is a nice library but not explicitly for optimizing k-means clusters.

ChIP-Seq deeptools galaxy sequencing • 1.6k views

ADD COMMENT • link updated 4.3 years ago by Mensur Dlakic ★ 27k • written 4.3 years ago by Ready2Rapture ▴ 20

1

Entering edit mode

4.3 years ago

Mensur Dlakic ★ 27k

If you don't have your heart set on k-means clustering (not sure why you would), MCL will easily handle this matrix even on a regular computer. It is available for R here. Instead of guessing/finding the number of clusters, you try different inflation values and see which clustering solution looks best.

ADD COMMENT • link 4.3 years ago by Mensur Dlakic ★ 27k

score 2 · Accepted Answer · 2020-01-01

2

Entering edit mode

4.3 years ago

Devon Ryan 104k

The scipy k-means clustering algorithm (what I use in deepTools) is single-threaded, so save some cash and use a smaller node. K-means clustering as an algorithm becomes slower with increasing numbers of rows, so you could try just taking a random subset of 10 or 20 thousand rows. That will be much quicker. You can't actually visualize 180,000 rows in a plot anyway, since your monitor doesn't have that many pixels in any direction (graphics packages end up smoothing over points in such cases).