deepTools/Galaxy Optimize K-Means Clusters - Elbow Plot
2
0
Entering edit mode
4.3 years ago

Hello,

I'm trying to optimize my deeptools computeMatrix output for k-means clustering, but cannot properly generate an elbow plot.

I've tried loading the matrix as described in this post and then attempting to plot wss in R via:

m = read.delim("computeMatrixOperations.mat.gz", skip=1, header=F)
m = as.matrix(m[,-c(1:6)])

set.seed(123)
# Compute and plot wss for k = 2 to k = 15.
k.max <- 15
wss <- sapply(1:k.max, function(k){kmeans(m, k, nstart=50,iter.max = 15 )$tot.withinss})

plot(1:k.max, wss, type="b", pch = 19, frame = FALSE, xlab="Number of clusters K",ylab="Total within-clusters sum of squares")

But this is too computationally heavy for a ~180,000 x 720 matrix (even using c5n.18xlarge: 72 vCPUs + 192 GiB memory for a few hours) and perhaps incorrect. I have some more ideas on how this might be computed (e.g. with the .tab output) but ANY help would be appreciated since testing is rather computationally and time intensive.

I've also been experimenting with profileplyr which is a nice library but not explicitly for optimizing k-means clusters.

ChIP-Seq deeptools galaxy sequencing • 1.6k views
ADD COMMENT
2
Entering edit mode
4.3 years ago

The scipy k-means clustering algorithm (what I use in deepTools) is single-threaded, so save some cash and use a smaller node. K-means clustering as an algorithm becomes slower with increasing numbers of rows, so you could try just taking a random subset of 10 or 20 thousand rows. That will be much quicker. You can't actually visualize 180,000 rows in a plot anyway, since your monitor doesn't have that many pixels in any direction (graphics packages end up smoothing over points in such cases).

ADD COMMENT
0
Entering edit mode

This worked well, thank you Devon.

ADD REPLY
1
Entering edit mode
4.3 years ago
Mensur Dlakic ★ 27k

If you don't have your heart set on k-means clustering (not sure why you would), MCL will easily handle this matrix even on a regular computer. It is available for R here. Instead of guessing/finding the number of clusters, you try different inflation values and see which clustering solution looks best.

ADD COMMENT

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6