Question: (Closed) Cluster Big Data In R And Is Sampling Relevant?
gravatar for jingz8804
7.4 years ago by
jingz88040 wrote:

I'm new to data science and have a problem finding clusters in a data set with 200,000 rows and 50 columns in R.

Since the data have both numeric and nominal variables, methods like K-means which uses Euclidean distance measure doesn't seems like an appropriate choice. So I turn to PAM, agnes and hclust which accepts a distance matrix as input.

The daisy method can work on mixed-type data but the distance matrix is just too big: 200,000 times 200,000 is much larger than 2^31-1 (the vector length limit before R 3.0.0.)

The new R 3.0.0 released yesterday supports long vectors with length longer than 2^31-1. But a double matrix of 200,000 by 200,000 requires a continuous RAM larger than 16Gb which is not possible on my machine.

I read about parallel computing and bigmemory package and am not sure if they are going to help: if I'm using daisy, it will generates a big matrix which cannot fit in memory anyway.

I also read about this post: heatmaps in R with huge data

And in that post Chris Miller has mentioned that we should consider sampling with really large data.

So in my case, is it relevant to use sampling on the data set, cluster on the sample and then infer the structure of the whole data set?

Can you please give me some suggestion? Thanks you!

About my machine:

R version 3.0.0 (2013-04-03)

Platform: x86_64-w64-mingw32/x64 (64-bit)

OS: Windows 7 64bit

RAM: 16.0GB

R clustering • 3.7k views
ADD COMMENTlink modified 7.4 years ago by Sean Davis26k • written 7.4 years ago by jingz88040

Having so many rows may often indicate that something is wrong with your approach (unless you are running some customer analysis), you should give us more information about your setup, such that we can give more specific guidance, and as Sean wrote, the number of rows is maybe due to lack of a filtering step. IMHO in most of the cases, where you seem to have so many items for cluster analysis, there is either something wrong with your approach or you are working for Google or Amazon ;) So what is the biological entity that gives you 200,000 rows. I can only speculate that it is SNPs, then clustering might not make sense at all, but an association test would. Also, according to my calculations you would need 150 GB of RAM (a formidable size, but such servers can be bought) to hold the distance matrix. But there are possibly better alternatives than R for big data analysis, I remember an algorithm called BIRCH which is specifically designed for big data. Maybe you can find an implementation of this algorithm.

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by Michael Dondrup47k

Ok, found a BIRCH package for R: haven't tried it.

ADD REPLYlink written 7.4 years ago by Michael Dondrup47k

Hi Michael, thanks for your reply. In fact, I am doing a customer analysis...BIRCH seems to support only numeric values but I'll try to make some modifications if I can. Thanks again.

ADD REPLYlink written 7.4 years ago by jingz88040

Hi jingz, thank you for stating the aim of your study honestly. Unfortunately, this means that I have to close this post as off topic, because this site is for questions on bioinformatics and computational biology as by the FAQ and your question has no relation to that. I hope you can get a good answer elsewhere, e.g. on edit: someone was quicker than me with closing...

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by Michael Dondrup47k
gravatar for Sean Davis
7.4 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

You are asking a question on Biostars, but there is not really a description of the biological problem you are working with. You will want to be clear what the goals of the clustering are. It may be that the feature-dimension clusters are not so important and that the sample-dimension clustering is what you are really after. It also may be that an unsupervised approach (clustering) is not that relevant.

To move forward with your question, I would suggest reducing the number of input features (your 200k rows) down to something more intelligible and manageable. Often, this can be done without apparent loss of information since many biological assays (copy number, gene expression, methylation, etc.) have the property that a proportion (sometimes quite large) of the features measured do not show variability; you can probably safely remove those invariant features. Also, there are typically strong correlations in features, so you may be able to subsample data and still retain the global features of the clustering. Moving further along, you could consider perform dimensionality reduction (PCA, NMF, etc.) first and cluster based on the resulting "pseudo-features". You could then use features (loadings in the PCA case, for example) within the pseudo-features to guide your biological understanding.

ADD COMMENTlink written 7.4 years ago by Sean Davis26k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1799 users visited in the last hour