Estimating required memory for WGCNA analysis
1
0
Entering edit mode
5.6 years ago
klkeysb ▴ 10

We are running WGCNA on ~90,000 genes in a single block with 48 threads and 192GB of memory using the blockwiseModules function.

WGCNA takes several dozen hours to compute the topological overlap matrix. We thought that 192GB would be sufficient for the analysis. But WGCNA chokes when exporting the TOM.

How can we estimate the memory required for blockwiseModules to complete successfully? We have included the output below:

 ..Working on block 1 .
    TOM calculation: adjacency..
    ..will use 48 parallel threads.
     Fraction of slow calculations: 0.000000
    ..connectivity..
    ..matrix multiplication (system BLAS)..
    ..normalization..
    ..done.
   ..saving TOM for block 1 into file output/100000/wgcna/TOM-block.1.RData
 ....clustering..
Error in fastcluster::hclust(as.dist(dissTom), method = "average") :
  Memory overflow.
Calls: blockwiseModules -> <Anonymous>
Execution halted
R RNA-Seq wgcna coexpression • 4.5k views
ADD COMMENT
1
Entering edit mode
5.6 years ago

I'm not so sure that 192GB is sufficient for a dataset of that size. Even if it were sufficient for just generating the correlation matrix, it leaves little room for other operations.

I think that you should request >200GB.

Take a look at this article: “Blockwise” network analysis of large data

Kevin

ADD COMMENT
0
Entering edit mode

I have seen that article before. The operative line is here:

16 GB memory should be able to handle up to about 24,000 nodes; 32 GB should be enough (perhaps barely so) for 40,000 and so on.

By that calculation, 80k transcripts require 64GB of memory. Imagine our surprise when moving to ~90k transcripts suddenly overloads 192GB.

These heuristics don't seem reliable. Is there a better way to guess at the required memory? This would inform the choice of node type that we choose before running WGCNA.

ADD REPLY
1
Entering edit mode

I believe the memory used will be system-dependent, and also dependent on your version of R (its under constant development behind the scenes). You may consider trying to reduce your dataset by, for example:

  • eliminating genes with low variance
  • eliminating genes with nil or low expression
  • eliminating certain classes of genes (like pseudogenes, if they are in your dataset)

Finally, you may try the Bioconductor support site ( https://support.bioconductor.org/t/Latest/ ), where the WGCNA developer is more active.

As I think about it, technically, one could write the correlation matrix to disk as the calculations are under way, and, in this way, save on memory when this [the correlation matrix] is being produced. You would then later just have to read this matrix back into your R session after, but the max memory required would be less.

Edit 29th August 2019: in fact, I have learned that this (what I wrote above) is precisely how bigcor does it

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6