Question: Parallelization to compute correlations
2
vincentpailler100 wrote:

Hello guys,

I want to compute correlations between OTUs (from metagenomic sequencing) . I am working with R and I would like to parallelize the computations because my data set is big.

What I want to do is maybe simple. I work on a cluster with Slurm (I have many CPUs and a lot of memory). I would like to cut my data set into subsets (by incrementing of 500 OTUs for example), and then, compute the correlations between each OTUs of the different subsets.

The main goal of this is to set a number of CPUs and RAM (for example 32 CPUs with 260 Go of RAM), start the computations of correlation between my subsets (500 OTUs, 1000 OTUs, 1500 OTUs... until 5000 OTUs) and look at the scalability. I will change the number of CPUs and the amount of RAM.

At the end, I will conclude by : "with 32 CPUs and 260 Go RAM, I can compute the correlation between X OTUs", "with 124 CPUs and 1To RAM, I can compute the correlation between X OTUs" ...

My problem is that I don't really know how to organize my code, and how to use the package mclapply() to parallelize.

``````library(optparse)

args <- commandArgs(trailingOnly = F)

# get options
option_list = list(
make_option(c("-s", "--subset"), type="character", default=NULL,
help="Input file matrix ")
);
opt_parser = OptionParser(usage = "Usage: %prog -f [FILE]",option_list=option_list,
description= "Description:")
opt = parse_args(opt_parser)

library(parallel)
library(doParallel)
library(compositional)

data=clr(data) #log_ratio transformation
data=data[1:opt\$subset,]

data=t(data)

cores<-32
options('mc.cores'=cores)
registerDoParallel(cores)

parallel=mclapply(cor(data, c='spearman'))

save(parallel, file=sprintf("/home/vipailler/PROJET_M2/data/parallel_%s.RData", opt\$subset))
``````
modified 8 months ago by Kevin Blighe51k • written 8 months ago by vincentpailler100

Vincent, I have edited my answer. Please read it again (at the top). Another user identified a better approach to do what you need.

5
Kevin Blighe51k wrote:

Edit July 18, 2019: forget my answer (below). It has been brought to my attention that fastcor can compute a large correlation matrix and only return the top half of this. See here: A: R functions for parallel processing

## ---------------------------------

Hi again Vincent,

Just for everyone else, this is a follow-up to a previous discussion about SNOW functionality and parallel implementation in R: Snowfall Parallelisation to compute Correlation Matrix

Vincent, I recommended `mclapply` in the previous thread in the knowledge that you were using some other package, which was then conducting some correlation analysis for you.

If you want to do the correlation yourself, `foreach` (works on all platforms) would be a better option. This code works:

``````library(parallel)
library(doParallel)
``````

# Choose and set number of cores

``````cores <- makeCluster(detectCores(), type='PSOCK') # grabs max available

cores <- 6  # explicitly choose number of cores
``````

Windows

``````cl <- makeCluster(getOption('cl.cores', cores))
registerDoParallel(cl)
registerDoSEQ()
on.exit(stopCluster(cl))
``````

Mac / Linux / UNIX

``````options('mc.cores' = cores)
registerDoParallel(cores)
``````

# Generate 10000 x 2000 random data matrix

``````mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
  2000 10000

mat[1:5,1:5]
Sample1    Sample2    Sample3    Sample4   Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190  5.1977788  0.2951713  8.6172124  4.022473
Gene3 4.1651056  0.5793664  9.3953203  0.8494895 19.985724
Gene4 8.5216108 15.1742012  7.2660707  7.8875059  6.572094
Gene5 8.9048996  4.7932319  1.6599490  1.2448652 25.325664
``````

# use `foreach` to parallelise the correlation

``````res <- foreach(i = seq_len(ncol(mat)),
.combine = rbind,
.multicombine = TRUE,
.inorder = FALSE,
.packages = c('data.table', 'doParallel')) %dopar% {
cor(mat[,i], mat, method = 'pearson')
}

res[1:10,1:10]
Sample1      Sample2      Sample3      Sample4      Sample5      Sample6      Sample7      Sample8      Sample9     Sample10
[1,]  1.000000000 -0.002229854 -0.019986425  0.012332275 -0.028186074  0.001989957 -0.023539602  0.019429899  0.018364991 -0.005468089
[2,] -0.002229854  1.000000000 -0.028642588  0.043449095 -0.042194471  0.022356480  0.033054025 -0.023498252 -0.032226185 -0.009321138
[3,] -0.019986425 -0.028642588  1.000000000 -0.027933870  0.006611151  0.047751182 -0.038504742 -0.015981761 -0.044780981  0.042516990
[4,]  0.012332275  0.043449095 -0.027933870  1.000000000 -0.019726310 -0.011895754 -0.004191525  0.039994518  0.011178578 -0.038820658
[5,] -0.028186074 -0.042194471  0.006611151 -0.019726310  1.000000000  0.010329296 -0.016766063 -0.004144883  0.032845623 -0.019617268
[6,]  0.001989957  0.022356480  0.047751182 -0.011895754  0.010329296  1.000000000  0.007669371  0.002309490 -0.010981713 -0.008976607
[7,] -0.023539602  0.033054025 -0.038504742 -0.004191525 -0.016766063  0.007669371  1.000000000  0.007287154 -0.021833100  0.004763376
[8,]  0.019429899 -0.023498252 -0.015981761  0.039994518 -0.004144883  0.002309490  0.007287154  1.000000000  0.018914405 -0.012672712
[9,]  0.018364991 -0.032226185 -0.044780981  0.011178578  0.032845623 -0.010981713 -0.021833100  0.018914405  1.000000000  0.004782734
[10,] -0.005468089 -0.009321138  0.042516990 -0.038820658 -0.019617268 -0.008976607  0.004763376 -0.012672712  0.004782734  1.000000000
``````

# check some results to ensure that we have done it right

``````table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE
2500

table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
TRUE
49729

table(cor(mat[,1], mat) == res[1,])
TRUE
10000

table(cor(mat[,6666], mat) == res[6666,])
TRUE
10000
``````

I have done this now using R on Windows and it worked fine.

Kevin