I have a WGS dataset consisting of ~2000 samples. I want to look at the co-occurrence of mutations (ie. how often a given pair of genes is mutated in the same strain). The way I'm doing this requires the creation of a 30,000 X 30,000 matrix to represent all pairwise comparisons of genes in the genome. I tried this in R but it just stalls.
Any suggestions? I'm most proficient and R and Python but I've been looking for an excuse to learn C/C++/Java for ages so this might be a good opportunity. Also, I have access to a compute cluster (which includes, amongst other things, some 96GB nodes).
Shane Neph and I came up with a tool called byte-store to store encoded pairwise score data on disk in one byte per comparison, as well as offer random access querying, calculating byte offsets from sort-bed-sorted BED coordinates of interest. The approach in byte-store is useful for generate-once-lookup-often operations, like Monte Carlo sampling, etc.
The byte-store tool includes Pearson r and Spearman's rho correlation scoring functions, which map scores between -1 and 1 to a single byte value. If you want to learn some C, you could write a function pointer that implements a custom pairwise scoring function for your dataset, and you might remap the correlation score range to your dataset's particular score range through an easy rescale function (you specify your min-max and remap a score to this range). Your dataset would be a 900MB file (30k * 30k bytes).
Thanks for all the suggestions! In the end I went for a row-by-row appraoch, as suggested by Jean-Karim Heriche karl.stamm. Using this approach as it is would take 10 days to run on the cluster. However, the approach has some helpful advantages, one of which being that I am now able to embarrassingly parallelize the process by assigning subsets of rows to different processors on the cluster. Another trick I used was to remove the first gene of each row after the row has been analysed. This prevents reciprocal gene-gene combinations from being counted twice and effectively converts the matrix into an upper-triangular matrix. Because the rows lower down will be shorter, they will take less time to process, so I can assign more and more rows to a given processor as I iterate through the rows. By employing the above, I got the run time down to ~7 hours.