Question

Heatmaps In R With Huge Data

14

Entering edit mode

12.3 years ago

Tonig ▴ 440

Dear list,

I'm trying to cluster data and then plotting a heatmap using heatmap.2 from ggplot, the script works perfectly with matrices up to 30000 rows, the problems is that I'm using matrices up to 500000 rows (data_sel), and when I try to cluster I get this error:

heatmap.2(as.matrix(data_sel),col=greenred(10), trace="none",cexRow=0.3, cexCol=0.3,  ColSideColors=fenot.colour, margins=c(20,1), labCol="", labRow="",distfun=function(x) dist(x,method="manhattan"))
Error in vector("double", length) : vector size specified is too large

Is there any approximation using R to plot heatmpas with his big data?

Thanks in advance

r heatmap illumina • 35k views

ADD COMMENT • link updated 2.5 years ago by Zhilong Jia ★ 2.2k • written 12.3 years ago by Tonig ▴ 440

0

Entering edit mode

What is the biological system (SNP, copy number data, or exon arrays)? What are the questions that a heatmap that you cannot read are likely to answer?

ADD REPLY • link 12.3 years ago by Sean Davis 26k

0

Entering edit mode

Are you using 32bit or 64bit R? I'm not sure if this will have anything to do with it, but 64bit R should give you increased memory address space, and therefore allow analysis of larger datasets?

ADD REPLY • link 12.3 years ago by Steve Moss 2.3k

0

Entering edit mode

In addition I would say attempting clustering of this number of data-oints brute-force shows there is room for improvement in pre-processing. You simply don't want to do it like this (or if you did you will see you can't)

ADD REPLY • link 12.3 years ago by Michael 54k

5

Entering edit mode

12.3 years ago

Steve Lianoglou 5.2k

As an alternative to the "you can't do it" responses you've gotten, I'd like to point out how you can -- not with R, though.

The folks creating GraphLab are working on ways to do machine learning on big data. The current incarnation is focused on running their algorithms on a "big-ass-server" type of environment. They've implemented several clustering algorithms in their graphlab clustering library.

You'll have to dump your data matrix into a text format that graphlab can read, like the Matrix Market format. You can then try any number of their clustering algorithms to see what's cooking.

So ... you can try that, but like other folks are suggesting, you might not get anything useful out of it, but I guess this is for you to decide.

ADD COMMENT • link 12.3 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

many thanks Steve, i'll try it!

ADD REPLY • link 12.3 years ago by Tonig ▴ 440

3

Entering edit mode

12.3 years ago

Ryan Dale 5.0k

A possible alternative: mini batch K-means as implemented in the scikits.learn package (Python) is surprisingly fast for large datasets. This doesn't solve memory limitations and you still have to choose k, but could be useful for getting a feel for the structure of your data and to help inform reduction approaches suggested by @Chris Miller

ADD COMMENT • link 12.3 years ago by Ryan Dale 5.0k

0

Entering edit mode

12.3 years ago

Jeremy Leipzig 22k

heatmap.2 is part of the gplots package, not ggplot2. That begs the question: if you want to try rendering this giant heatmap by hand in ggplot2, it might be trivial if you are using raw data as opposed to some ExpressionSet object. ggplot2 wants a "melted data frame" to render heatmaps.

Google tells me this is a memory limitation, so another alternative is to try running this on a Big-Ass Server™ or a large Bioconductor-branded Amazon EC2 instance

ADD COMMENT • link 12.3 years ago by Jeremy Leipzig 22k

0

Entering edit mode

sorry, I meant gplots

ADD REPLY • link 12.3 years ago by Tonig ▴ 440

0

Entering edit mode

12.3 years ago

User 6762 • 0

You can generate by using corplot.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by User 6762 • 0

0

Entering edit mode

12.3 years ago

Steve Moss 2.3k

I think this may be a memory issue? Are you using 32-bit or 64-bit R and on what platform (Windows or Linux)? How much memory do you have?

See this link regarding Memory Limits in R

I would try running the same analyses on a 64-bit build of R on a 64-bit Linux system (if you aren't already)!? Consider using ulimit to limit other applications, shutdown and start from a fresh system and kill any unneeded apps.

It may be that you are just reaching the limits of the vector size and have to repartition your data and try again, as Chris suggests!

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by Steve Moss 2.3k

0

Entering edit mode

I am on Linux and 32-bit R, I have 16 nodes and 64 RAM gb's

ADD REPLY • link 12.3 years ago by Tonig ▴ 440

0

Entering edit mode

The maximum addressable space on a 32bit system is 2^32 or 4GB RAM! On a 64bit system this increases to 2^64 or about 16 Exabytes!? What is your Linux build (uname -a)? I'd certainly install 64bit R if you're playing with big data sets! If you run 32bit you want be able to access all your memory!

ADD REPLY • link 12.3 years ago by Steve Moss 2.3k

0

Entering edit mode

2.5 years ago

Zhilong Jia ★ 2.2k

Using the parameter, kmeans_k, in pheatmap::pheatmap to cluster the rows of the data could visualise your data.

Note, kmeans_k: the number of kmeans clusters to make, if we want to aggregate the rows before drawing heatmap. If NA then the rows are not aggregated.

ADD COMMENT • link 2.5 years ago by Zhilong Jia ★ 2.2k

score 13 · Accepted Answer · 2012-01-17

13

Entering edit mode

12.3 years ago

Chris Miller 22k

With really large data, even a big-ass server may not be enough. The best advice I can offer is to try reducing the size of your data set. Doing this will require intelligently thinking about what it is you're trying to represent. (related: are you really going to be able to pick out 500k individual points on a little graph the size of your computer screen?)

If, for example, you're plotting expression data from an exon array, maybe you could merge the data and plot per-gene instead of per-exon. If you're looking for patterns of differential expression, maybe you could plot just the top 1000 most differentially expressed. (and so on)

ADD COMMENT • link 12.3 years ago by Chris Miller 22k

8

Entering edit mode

Another hilarious calculation showing Chris is correct: if you wanted to print such a heatmap to visualize it, and assign only 2 mm per data-row, the printout would be 1km long!

ADD REPLY • link 12.3 years ago by Michael 54k

5

Entering edit mode

Pre-processing and sampling parts of large data sets are the key to so-called "big data". People often seem to think that there must be a clever software solution, or that their software somehow has failings, when the simple answer is: if it won't fit easily in RAM, make it smaller.

ADD REPLY • link 12.3 years ago by Neilfws 49k

0

Entering edit mode

I am sometimes guilty of trying to analyse everything, wasting time in the process, whereas a random sample would suffice. However, sometimes everything is required - any chance of analysing the data on a chromosome by chromosome basis?

ADD REPLY • link 12.3 years ago by Ian 6.0k

score 8 · Accepted Answer · 2012-01-17

8

Entering edit mode

12.3 years ago

Michael 54k

It simply doesn't work, unless you have around ~930GB+overhead ( almost 1 Terabyte!) of main (free addressable) RAM to hold your distance matrix! Maybe twice or half that much depending on storage mode.

ADD COMMENT • link 12.3 years ago by Michael 54k