Question: Can I Perform K-Means Or Hierarchical Clustering Via R Across A Sun Grid Engine Computation Cluster?
4
gravatar for Alex Reynolds
9.8 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Is there an easy way to perform supervised and unsupervised clustering via R, where the calculation and memory requirements are shared across multiple nodes of a Sun Grid Engine computation cluster?

R clustering • 4.6k views
ADD COMMENTlink written 9.8 years ago by Alex Reynolds29k

This may also be helpful: Which R Packages, If Any, Are Best For Parallel Computing ?

ADD REPLYlink modified 5 months ago by RamRS25k • written 9.8 years ago by Istvan Albert ♦♦ 82k

Can you describe your data and use-cases. That would help to understand why serial clustering is not sufficient.

ADD REPLYlink written 9.8 years ago by Michael Dondrup47k
3
gravatar for Michael Dondrup
9.8 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

This is possibly more complex than it looks at first glance and I don't have practical experience with such implementations so this answer is only theoretical.

The problem breaks down into two steps:

  • parallelizing the sequential clustering algorithm or designing a novel algorithm.
  • having the parallel clustering algorithm run on a grid using for example using the snow package in R or MPI

The first step is essential. In particular, k-means should be easier to parallelize than hierarchical-clustering. k-means could be parallelized by simple data parallelism in each step. Agglomerative hierarchical clustering on the other hand needs access to the full distance matrix in each step and the distance matrix needs to be shared and updated between all compute nodes.

The second step is maybe implementable using the snow package and mpi. sun grid engine should support MPI. The pvclust package is related to hierarchical-clustering and uses the snow package. As far as I understand, the clustering itself is not carried out in parallel but sequential h-clustering is carried out a 1000 times in parallel for bootstrapping.

With "revolution R" (see the link in Istvan's comment) you could benefit a bit from a multi-threaded math library, but that does not mean that the clustering functions are necessarily implemented as parallel algorithms.

Edit: In conclusion (AFAIK):

  • there is no out-of-the-box solution yet that combines parallel clustering, R and sun grid engine.
  • The effort for programming/testing does maybe not justify the expected gain in speed/memory efficiency except for extremely large datasets.
  • There is no guarantee that a parallel implementation has to be more in efficient memory/computation.
  • I wouldn't invest too much time into this without knowing the real use-cases.

Edit: The Rgpu package provides implementations of some statistical algorithms using CUDA and GPU (I know, not exactly what you were looking for, but could provide significant speedup if you have a Nvidia graphics card). Provides functions like gpuDist and gpuHclust. I will give this package I try on a Mac. This option could be limited by the available graphics memory, I guess the distance matrix has to fit into it.

High-Performance computing with R: http://cran.r-project.org/web/views/HighPerformanceComputing.html

Here are some links to some papers I found for "parallel clustering":

About parallel hierarchical clustering:

A parallel k-means implementation : http://www.eecs.northwestern.edu/~wkliao/Kmeans/index.html

ADD COMMENTlink modified 17 months ago by RamRS25k • written 9.8 years ago by Michael Dondrup47k
2
gravatar for Marcin Cieslik
9.8 years ago by
Marcin Cieslik520 wrote:

I'd not restrict myself to R and go for MAHOUT

your technology stack could be:

SGE + HADDOP (map-reduce) + MAHOUT (parallel machine learning)

haddop on SGE: http://blogs.sun.com/ravee/entry/creating_hadoop_pe_under_sge http://blogs.sun.com/templedf/entry/welcome_sun_grid_engine_6

You would not have to implement any parallel algorithms, but rather stitch the components together configure. This would give you the flexibility of trying different algorithms on your data.

ADD COMMENTlink written 9.8 years ago by Marcin Cieslik520

Can you give some code examples for how to do cluster analysis with mahout? It looks interesting because its connection to lucene, but there is almost no documentation it seems. It's probably too early and to difficult to use this solution for this use-case.

ADD REPLYlink written 9.8 years ago by Michael Dondrup47k
2
gravatar for D. Puthier
9.6 years ago by
D. Puthier320
France/Marseille/Inserm
D. Puthier320 wrote:

Did you try the amap BioC library ? It is a very simple solution for hierarchical clustering.

library(amap)
nb <- 20
h <- hcluster(x, method = "pearson", nbproc = nb)

Regards

ADD COMMENTlink modified 17 months ago by RamRS25k • written 9.6 years ago by D. Puthier320

Yep, and that works convenient for much larger datasets than hclust.

ADD REPLYlink written 9.6 years ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 677 users visited in the last hour