Deleted:Cluster Big Data In R And Is Sampling Relevant?
1
0
Entering edit mode
12.3 years ago

I'm new to data science and have a problem finding clusters in a data set with 200,000 rows and 50 columns in R.

Since the data have both numeric and nominal variables, methods like K-means which uses Euclidean distance measure doesn't seems like an appropriate choice. So I turn to PAM, agnes and hclust which accepts a distance matrix as input.

The daisy method can work on mixed-type data but the distance matrix is just too big: 200,000 times 200,000 is much larger than 2^31-1 (the vector length limit before R 3.0.0.)

The new R 3.0.0 released yesterday supports long vectors with length longer than 2^31-1. But a double matrix of 200,000 by 200,000 requires a continuous RAM larger than 16Gb which is not possible on my machine.

I read about parallel computing and bigmemory package and am not sure if they are going to help: if I'm using daisy, it will generates a big matrix which cannot fit in memory anyway.

I also read about this post: heatmaps in R with huge data

And in that post Chris Miller has mentioned that we should consider sampling with really large data.

So in my case, is it relevant to use sampling on the data set, cluster on the sample and then infer the structure of the whole data set?

Can you please give me some suggestion? Thanks you!

About my machine:

R version 3.0.0 (2013-04-03)

Platform: x86_64-w64-mingw32/x64 (64-bit)

OS: Windows 7 64bit

RAM: 16.0GB

clustering r • 4.3k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 3613 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6