**0**wrote:

I'm new to data science and have a problem finding clusters in a data set with 200,000 rows and 50 columns in R.

Since the data have both numeric and nominal variables, methods like K-means which uses Euclidean distance measure doesn't seems like an appropriate choice. So I turn to PAM, agnes and hclust which accepts a distance matrix as input.

The daisy method can work on mixed-type data but the distance matrix is just too big: 200,000 times 200,000 is much larger than 2^31-1 (the vector length limit before R 3.0.0.)

The new R 3.0.0 released yesterday supports long vectors with length longer than 2^31-1. But a double matrix of 200,000 by 200,000 requires a continuous RAM larger than 16Gb which is not possible on my machine.

I read about parallel computing and bigmemory package and am not sure if they are going to help: if I'm using daisy, it will generates a big matrix which cannot fit in memory anyway.

I also read about this post: heatmaps in R with huge data

And in that post Chris Miller has mentioned that we should consider sampling with really large data.

So in my case, is it relevant to use sampling on the data set, cluster on the sample and then infer the structure of the whole data set?

Can you please give me some suggestion? Thanks you!

About my machine:

R version 3.0.0 (2013-04-03)

Platform: x86_64-w64-mingw32/x64 (64-bit)

OS: Windows 7 64bit

RAM: 16.0GB

**25k**• written 6.0 years ago by jingz8804 •

**0**

Having so many rows may often indicate that something is wrong with your approach (unless you are running some customer analysis), you should give us more information about your setup, such that we can give more specific guidance, and as Sean wrote, the number of rows is maybe due to lack of a filtering step. IMHO in most of the cases, where you seem to have so many items for cluster analysis, there is either something wrong with your approach or you are working for Google or Amazon ;) So what is the biological entity that gives you 200,000 rows. I can only speculate that it is SNPs, then clustering might not make sense at all, but an association test would. Also, according to my calculations you would need 150 GB of RAM (a formidable size, but such servers can be bought) to hold the distance matrix. But there are possibly better alternatives than R for big data analysis, I remember an algorithm called BIRCH which is specifically designed for big data. Maybe you can find an implementation of this algorithm.

45kOk, found a BIRCH package for R: http://cran.r-project.org/web/packages/birch/index.html haven't tried it.

45kHi Michael, thanks for your reply. In fact, I am doing a customer analysis...BIRCH seems to support only numeric values but I'll try to make some modifications if I can. Thanks again.

0Hi jingz, thank you for stating the aim of your study honestly. Unfortunately, this means that I have to close this post as off topic, because this site is for questions on bioinformatics and computational biology as by the FAQ and your question has no relation to that. I hope you can get a good answer elsewhere, e.g. on http://stats.stackexchange.com/ edit: someone was quicker than me with closing...

45k