Question

Huge matrix memory issue

0

Entering edit mode

7.3 years ago

dr_bantz ▴ 110

Hi everyone,

I have a WGS dataset consisting of ~2000 samples. I want to look at the co-occurrence of mutations (ie. how often a given pair of genes is mutated in the same strain). The way I'm doing this requires the creation of a 30,000 X 30,000 matrix to represent all pairwise comparisons of genes in the genome. I tried this in R but it just stalls.

Any suggestions? I'm most proficient and R and Python but I've been looking for an excuse to learn C/C++/Java for ages so this might be a good opportunity. Also, I have access to a compute cluster (which includes, amongst other things, some 96GB nodes).

Thanks in advance for your help!

genome • 1.6k views

ADD COMMENT • link 7.3 years ago by dr_bantz ▴ 110

0

Entering edit mode

What is it you're trying to do ? Do you want to compute this matrix or do you have it already and want to process it in some ways ? In the first instance, you could parallelize the computation of the matrix e.g. by row and reassemble the pieces you need in a full matrix at the end. If most of the elements are 0, you could try using a sparse matrix format (see for example the Matrix package in R). However, a 30 000 x 30 000 matrix should require "only" ~7 GB of RAM (assuming 8 bytes per cell).

ADD REPLY • link 7.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Each i,jth cell would contain the number of times gene i and gene j are mutated in the same train, so this would be have to be computed. The plan is to then make a second matrix with the Poisson probability for the value i,j based on the occurrence of mutation in gene i and j. Your parallelization idea might work nicely.

ADD REPLY • link 7.3 years ago by dr_bantz ▴ 110

0

Entering edit mode

R isn't anywhere near memory efficient. It's going to need more RAM than we can estimate. I suggest this task be computed as a list of genes, so each is a separate entity that does not need to be kept in memory. Call an optimized program like bedtools to calculate each row for you in one step.

ADD REPLY • link 7.3 years ago by karl.stamm 4.1k

0

Entering edit mode

At least on my Mac, R 3.3.2 appears to deal with matrices in an efficient way:

m <- matrix(numeric(2.3),nrow = 30000, ncol=30000) 
object.size(m)
7200000200 bytes

which is ~7 GB.

ADD REPLY • link 7.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

instead of a matrix can't you use a table with 3 columns LEFT TOP VALUE and then use the common linux tools ?

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 163k

score 0 · Answer 1 · 2017-04-10

0

Entering edit mode

7.3 years ago

Girolamo ▴ 140

I assume you have a sparse matrix you could represent the matrix as a combination of three arrays http://btechsmartclass.com/DS/U1_T14.html another possibility is to use shelve https://docs.python.org/2/library/shelve.html

ADD COMMENT • link 7.3 years ago by Girolamo ▴ 140

score 0 · Answer 2 · 2017-04-10

Shane Neph and I came up with a tool called byte-store to store encoded pairwise score data on disk in one byte per comparison, as well as offer random access querying, calculating byte offsets from sort-bed-sorted BED coordinates of interest. The approach in byte-store is useful for generate-once-lookup-often operations, like Monte Carlo sampling, etc.

The byte-store tool includes Pearson r and Spearman's rho correlation scoring functions, which map scores between -1 and 1 to a single byte value. If you want to learn some C, you could write a function pointer that implements a custom pairwise scoring function for your dataset, and you might remap the correlation score range to your dataset's particular score range through an easy rescale function (you specify your min-max and remap a score to this range). Your dataset would be a 900MB file (30k * 30k bytes).

score 0 · Answer 3 · 2017-04-12

Thanks for all the suggestions! In the end I went for a row-by-row appraoch, as suggested by Jean-Karim Heriche karl.stamm. Using this approach as it is would take 10 days to run on the cluster. However, the approach has some helpful advantages, one of which being that I am now able to embarrassingly parallelize the process by assigning subsets of rows to different processors on the cluster. Another trick I used was to remove the first gene of each row after the row has been analysed. This prevents reciprocal gene-gene combinations from being counted twice and effectively converts the matrix into an upper-triangular matrix. Because the rows lower down will be shorter, they will take less time to process, so I can assign more and more rows to a given processor as I iterate through the rows. By employing the above, I got the run time down to ~7 hours.