Question: Heatmaps In R With Huge Data
14
gravatar for Tonig
7.9 years ago by
Tonig440
Tonig440 wrote:

Dear list,

I'm trying to cluster data and then plotting a heatmap using heatmap.2 from ggplot, the script works perfectly with matrices up to 30000 rows, the problems is that I'm using matrices up to 500000 rows (data_sel), and when I try to cluster I get this error:

heatmap.2(as.matrix(data_sel),col=greenred(10), trace="none",cexRow=0.3, cexCol=0.3,  ColSideColors=fenot.colour, margins=c(20,1), labCol="", labRow="",distfun=function(x) dist(x,method="manhattan"))
Error in vector("double", length) : vector size specified is too large

Is there any approximation using R to plot heatmpas with his big data?

Thanks in advance

R heatmap illumina • 22k views
ADD COMMENTlink written 7.9 years ago by Tonig440

What is the biological system (SNP, copy number data, or exon arrays)? What are the questions that a heatmap that you cannot read are likely to answer?

ADD REPLYlink written 7.9 years ago by Sean Davis25k

Are you using 32bit or 64bit R? I'm not sure if this will have anything to do with it, but 64bit R should give you increased memory address space, and therefore allow analysis of larger datasets?

ADD REPLYlink written 7.9 years ago by Steve Moss2.3k

In addition I would say attempting clustering of this number of data-oints brute-force shows there is room for improvement in pre-processing. You simply don't want to do it like this (or if you did you will see you can't)

ADD REPLYlink written 7.9 years ago by Michael Dondrup47k
11
gravatar for Chris Miller
7.9 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

With really large data, even a big-ass server may not be enough. The best advice I can offer is to try reducing the size of your data set. Doing this will require intelligently thinking about what it is you're trying to represent. (related: are you really going to be able to pick out 500k individual points on a little graph the size of your computer screen?)

If, for example, you're plotting expression data from an exon array, maybe you could merge the data and plot per-gene instead of per-exon. If you're looking for patterns of differential expression, maybe you could plot just the top 1000 most differentially expressed. (and so on)

ADD COMMENTlink written 7.9 years ago by Chris Miller21k
8

Another hilarious calculation showing Chris is correct: if you wanted to print such a heatmap to visualize it, and assign only 2 mm per data-row, the printout would be 1km long!

ADD REPLYlink written 7.9 years ago by Michael Dondrup47k
4

Pre-processing and sampling parts of large data sets are the key to so-called "big data". People often seem to think that there must be a clever software solution, or that their software somehow has failings, when the simple answer is: if it won't fit easily in RAM, make it smaller.

ADD REPLYlink written 7.9 years ago by Neilfws48k

I am sometimes guilty of trying to analyse everything, wasting time in the process, whereas a random sample would suffice. However, sometimes everything is required - any chance of analysing the data on a chromosome by chromosome basis?

ADD REPLYlink written 7.9 years ago by Ian5.6k
6
gravatar for Michael Dondrup
7.9 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

It simply doesn't work, unless you have around ~930GB+overhead ( almost 1 Terabyte!) of main (free addressable) RAM to hold your distance matrix! Maybe twice or half that much depending on storage mode.

ADD COMMENTlink written 7.9 years ago by Michael Dondrup47k
5
gravatar for Steve Lianoglou
7.9 years ago by
Steve Lianoglou5.0k
US
Steve Lianoglou5.0k wrote:

As an alternative to the "you can't do it" responses you've gotten, I'd like to point out how you can -- not with R, though.

The folks creating GraphLab are working on ways to do machine learning on big data. The current incarnation is focused on running their algorithms on a "big-ass-server" type of environment. They've implemented several clustering algorithms in their graphlab clustering library.

You'll have to dump your data matrix into a text format that graphlab can read, like the Matrix Market format. You can then try any number of their clustering algorithms to see what's cooking.

So ... you can try that, but like other folks are suggesting, you might not get anything useful out of it, but I guess this is for you to decide.

ADD COMMENTlink written 7.9 years ago by Steve Lianoglou5.0k

many thanks Steve, i'll try it!

ADD REPLYlink written 7.9 years ago by Tonig440
3
gravatar for Ryan Dale
7.9 years ago by
Ryan Dale4.8k
Bethesda, MD
Ryan Dale4.8k wrote:

A possible alternative: mini batch K-means as implemented in the scikits.learn package (Python) is surprisingly fast for large datasets. This doesn't solve memory limitations and you still have to choose k, but could be useful for getting a feel for the structure of your data and to help inform reduction approaches suggested by @Chris Miller

ADD COMMENTlink written 7.9 years ago by Ryan Dale4.8k
0
gravatar for Jeremy Leipzig
7.9 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

heatmap.2 is part of the gplots package, not ggplot2. That begs the question: if you want to try rendering this giant heatmap by hand in ggplot2, it might be trivial if you are using raw data as opposed to some ExpressionSet object. ggplot2 wants a "melted data frame" to render heatmaps.

Google tells me this is a memory limitation, so another alternative is to try running this on a Big-Ass Serverâ„¢ or a large Bioconductor-branded Amazon EC2 instance

ADD COMMENTlink written 7.9 years ago by Jeremy Leipzig18k

sorry, I meant gplots

ADD REPLYlink written 7.9 years ago by Tonig440
0
gravatar for User 6762
7.9 years ago by
User 67620
User 67620 wrote:

You can generate by using corplot.

ADD COMMENTlink modified 12 weeks ago by RamRS25k • written 7.9 years ago by User 67620
0
gravatar for Steve Moss
7.9 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

I think this may be a memory issue? Are you using 32-bit or 64-bit R and on what platform (Windows or Linux)? How much memory do you have?

See this link regarding Memory Limits in R

I would try running the same analyses on a 64-bit build of R on a 64-bit Linux system (if you aren't already)!? Consider using ulimit to limit other applications, shutdown and start from a fresh system and kill any unneeded apps.

It may be that you are just reaching the limits of the vector size and have to repartition your data and try again, as Chris suggests!

ADD COMMENTlink modified 12 weeks ago by RamRS25k • written 7.9 years ago by Steve Moss2.3k

I am on Linux and 32-bit R, I have 16 nodes and 64 RAM gb's

ADD REPLYlink written 7.9 years ago by Tonig440

The maximum addressable space on a 32bit system is 2^32 or 4GB RAM! On a 64bit system this increases to 2^64 or about 16 Exabytes!? What is your Linux build (uname -a)? I'd certainly install 64bit R if you're playing with big data sets! If you run 32bit you want be able to access all your memory!

ADD REPLYlink written 7.9 years ago by Steve Moss2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 840 users visited in the last hour