I'm new to data science and have a problem finding clusters in a data set with 200,000 rows and 50 columns in R.
Since the data have both numeric and nominal variables, methods like K-means which uses Euclidean distance measure doesn't seems like an appropriate choice. So I turn to PAM, agnes and hclust which accepts a distance matrix as input.
The daisy method can work on mixed-type data but the distance matrix is just too big: 200,000 times 200,000 is much larger than 2^31-1 (the vector length limit before R 3.0.0.)
The new R 3.0.0 released yesterday supports long vectors with length longer than 2^31-1. But a double matrix of 200,000 by 200,000 requires a continuous RAM larger than 16Gb which is not possible on my machine.
I read about parallel computing and bigmemory package and am not sure if they are going to help: if I'm using daisy, it will generates a big matrix which cannot fit in memory anyway.
I also read about this post: heatmaps in R with huge data
And in that post Chris Miller has mentioned that we should consider sampling with really large data.
So in my case, is it relevant to use sampling on the data set, cluster on the sample and then infer the structure of the whole data set?
Can you please give me some suggestion? Thanks you!
About my machine:
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
OS: Windows 7 64bit
RAM: 16.0GB