Choose # of cores

Question

Snowfall Parallelisation to compute Correlation Matrix

0

Entering edit mode

5.2 years ago

pablo ▴ 300

Hi everyone,

I try to deal with parallelization and the R package "Snowfall" .

I have my R code which looks like (propr is a package to compute correlation between counts from metagenomics data)

library(propr) test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t") test<-t(test) propr<-propr(test, metric="rho")

The correlation matrix I want to generate is very huge (about 5To), that's why I try to get used to parallelization to compute it (I have 12T of memory and many CPUs, I work on a cluster, so the computing power is not a problem)

But I really don't understand how to incorporate my R code into a snowfall code.. Does someone would know how to do it?

Bests, Vincent

snowfall parallel • 2.1k views

ADD COMMENT • link 5.2 years ago by pablo ▴ 300

0

Entering edit mode

Thanks for answering me.

How could I know if propr implements parallel functionality ?

My code looks like this :

library(doParallel)
library(propr) 

cores <- 128
options('mc.cores' = cores)
registerDoParallel(cores)

read.table("truelength2.prok2.uniref2.rares.tsv", h=T, row.names=1, sep="\t")->data
mclapply(propr(t(data), metric="rho"))-> parallel
size=format(object.size(parallel), units="Gb")

Does it look fine or not?

Thanks

ADD REPLY • link updated 5.2 years ago by Kevin Blighe 87k • written 5.2 years ago by pablo ▴ 300

0

Entering edit mode

No, that will not do anything. Also, why do you assign right with -> (just curious).

You should look up how mclapply() works. It functions in exactly the same way as lapply(). Just looking at your code, it may be something like:

mclapply(t(data), function(x) propr(x))

In pseudo code: Apply the function propr() to t(data)

I do not know anything about propr(), though. What is it doing to your data? You should study how it works internally, if possible, to see how it could be parallelised in different ways.

Sometimes one has to edit the internal code to enable parallelisation, like I did for clusGap: https://github.com/kevinblighe/clusGapKB

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

I don't really know why I assign right, I get used to I guess

propr() is a package which computes correlation between compositional data , so :

my data is a matrix which matches abundances of OTUs within different samples (about 900 000 OTUs and 40 samples) And propr() could help me to compute correlation between these OTUs which gives an huge matrix ((900000*899999)/2 correlations)

ADD REPLY • link 5.2 years ago by pablo ▴ 300

0

Entering edit mode

I do not see anything in the propr documentation that indicates that it is designed for parallel processing, so, even registering cores will have no effect.

I looked at the actual code of the function, too, and I can see that it is not doing anything related to parallel-processing. When you set it to do correlation, in fact, it just uses the cor() function from the base stats package.

Did you not try the bigcor package?

I cannot see everything that you are trying at your console, so, my suggestions may be irrelevant.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for the answer

Actually, I tried bigcor which looks fine (I just tried on my computer with a reduced data set (correlation between 15 000 OTUs), I can't work on a cluster at the moment)

I will try on the cluster tomorrow on my main dataset (correlation between 900 000 OTUs), when I can allocate lots of more memory (until 12To)

Does the number of CPUs allocated to compute the correlations between the 900 000 OTUs is revelant?

ADD REPLY • link 5.2 years ago by pablo ▴ 300

0

Entering edit mode

The use of multiple CPU cores to calculate a correlation matrix can increase the speed [to generate the matrix]; however, it will depend on how the correlation function is designed. I actually wrote a parallelised correlation function in 2016 but I was not happy with it; so, I deleted it...

I think that bigcor can do it relatively quickly. It can do it by computing the correlations in sections and [I believe[ saving these to disk in order to save memory.

Of course, generating the correlation matrix is one thing... after, you will have to filter the data.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2019-03-04

You likely have to register the cores that you want to use. Also, if propr does not implement parallel functionality, then not even registering the cores will do anything, and you will have to edit the function code to 'parallelise' it.

I am not familiar with 'Snowfall', however, it likely relates to the fact that parallelisation on Windows (well, pseudo parallelisation) is implemented via SNOW.

If you are indeed using Windows, cores / threads for certain functions have to be registered as a cluster object that implements SNOW functionality, as I show below.

Choose # of cores

# grab max cores available
cores <- makeCluster(detectCores(), type='PSOCK')

# explicitly choose number of cores
cores <- 12

Windows

cl <- makeCluster(getOption('cl.cores', cores))
registerDoParallel(cl)
registerDoSEQ()
on.exit(stopCluster(cl))

Then, lapply() is implemented via parLapply(cl, ...)

Mac / Linux / UNIX

If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel():

options('mc.cores' = cores)
registerDoParallel(cores)

Then, lapply() is implemented via mclapply(...)

-----------------------------------

I have written more, here: R functions for parallel processing

Another function in which you may have interest is bigcor.

Kevin