Question: Estimating required memory for WGCNA analysis
0
gravatar for klkeysb
10 days ago by
klkeysb0
klkeysb0 wrote:

We are running WGCNA on ~90,000 genes in a single block with 48 threads and 192GB of memory using the blockwiseModules function.

WGCNA takes several dozen hours to compute the topological overlap matrix. We thought that 192GB would be sufficient for the analysis. But WGCNA chokes when exporting the TOM.

How can we estimate the memory required for blockwiseModules to complete successfully? We have included the output below:

 ..Working on block 1 .
    TOM calculation: adjacency..
    ..will use 48 parallel threads.
     Fraction of slow calculations: 0.000000
    ..connectivity..
    ..matrix multiplication (system BLAS)..
    ..normalization..
    ..done.
   ..saving TOM for block 1 into file output/100000/wgcna/TOM-block.1.RData
 ....clustering..
Error in fastcluster::hclust(as.dist(dissTom), method = "average") :
  Memory overflow.
Calls: blockwiseModules -> <Anonymous>
Execution halted
coexpression rna-seq wgcna R • 116 views
ADD COMMENTlink modified 8 days ago by Kevin Blighe28k • written 10 days ago by klkeysb0
1
gravatar for Kevin Blighe
8 days ago by
Kevin Blighe28k
USA / Europe / Brazil
Kevin Blighe28k wrote:

I'm not so sure that 192GB is sufficient for a dataset of that size. Even if it were sufficient for just generating the correlation matrix, it leaves little room for other operations.

I think that you should request >200GB.

Take a look at this article: “Blockwise” network analysis of large data

Kevin

ADD COMMENTlink written 8 days ago by Kevin Blighe28k

I have seen that article before. The operative line is here:

16 GB memory should be able to handle up to about 24,000 nodes; 32 GB should be enough (perhaps barely so) for 40,000 and so on.

By that calculation, 80k transcripts require 64GB of memory. Imagine our surprise when moving to ~90k transcripts suddenly overloads 192GB.

These heuristics don't seem reliable. Is there a better way to guess at the required memory? This would inform the choice of node type that we choose before running WGCNA.

ADD REPLYlink written 8 days ago by klkeysb0

I believe that the memory used will be system-dependent, and also dependent on your version of R (its under constant development behind the scenes). You may consider trying to reduce your dataset by, for example:

  • eliminating genes with low variance
  • eliminating genes with nil or low expression
  • eliminating certain classes of genes (like pseudogenes, if they are in your dataset)

Finally, you may try the Bioconductor support site ( https://support.bioconductor.org/t/Latest/ ), where the WGCNA developer is more active.

As I think about it, technically, one could write the correlation matrix to disk as the calculations are under way, and, in this way, save on memory when this [the correlation matrix] is being produced. You would then later just have to read this matrix back into your R session after, but the max memory required would be less. I've worked on ways around these issues, including memory and CPU usage in R (see R functions edited for parallel processing and https://github.com/kevinblighe/StatParallel ). I also have my own network analysis protocol ( Network plot from expression data in R using igraph ), but it's nowhere near as comprehensive as WGCNA yet.

ADD REPLYlink modified 7 days ago • written 7 days ago by Kevin Blighe28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour