Question: Snowfall Parallelisation to compute Correlation Matrix
gravatar for vincentpailler
11 months ago by
vincentpailler110 wrote:

Hi everyone,

I try to deal with parallelization and the R package "Snowfall" .

I have my R code which looks like (propr is a package to compute correlation between counts from metagenomics data)

library(propr) test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t") test<-t(test) propr<-propr(test, metric="rho")

The correlation matrix I want to generate is very huge (about 5To), that's why I try to get used to parallelization to compute it (I have 12T of memory and many CPUs, I work on a cluster, so the computing power is not a problem)

But I really don't understand how to incorporate my R code into a snowfall code.. Does someone would know how to do it?

Bests, Vincent

parallel snowfall • 524 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by vincentpailler110

Thanks for answering me.

How could I know if propr implements parallel functionality ?

My code looks like this :


cores <- 128
options('mc.cores' = cores)

read.table("truelength2.prok2.uniref2.rares.tsv", h=T, row.names=1, sep="\t")->data
mclapply(propr(t(data), metric="rho"))-> parallel
size=format(object.size(parallel), units="Gb")

Does it look fine or not?


ADD REPLYlink modified 11 months ago by Kevin Blighe54k • written 11 months ago by vincentpailler110

No, that will not do anything. Also, why do you assign right with -> (just curious).

You should look up how mclapply() works. It functions in exactly the same way as lapply(). Just looking at your code, it may be something like:

mclapply(t(data), function(x) propr(x))

In pseudo code: Apply the function propr() to t(data)

I do not know anything about propr(), though. What is it doing to your data? You should study how it works internally, if possible, to see how it could be parallelised in different ways.

Sometimes one has to edit the internal code to enable parallelisation, like I did for clusGap:

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe54k

I don't really know why I assign right, I get used to I guess

propr() is a package which computes correlation between compositional data , so :

  • my data is a matrix which matches abundances of OTUs within different samples (about 900 000 OTUs and 40 samples) And propr() could help me to compute correlation between these OTUs which gives an huge matrix ((900000*899999)/2 correlations)
ADD REPLYlink written 11 months ago by vincentpailler110

I do not see anything in the propr documentation that indicates that it is designed for parallel processing, so, even registering cores will have no effect.

I looked at the actual code of the function, too, and I can see that it is not doing anything related to parallel-processing. When you set it to do correlation, in fact, it just uses the cor() function from the base stats package.

Did you not try the bigcor package?

I cannot see everything that you are trying at your console, so, my suggestions may be irrelevant.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe54k

Thanks for the answer

Actually, I tried bigcor which looks fine (I just tried on my computer with a reduced data set (correlation between 15 000 OTUs), I can't work on a cluster at the moment)

I will try on the cluster tomorrow on my main dataset (correlation between 900 000 OTUs), when I can allocate lots of more memory (until 12To)

Does the number of CPUs allocated to compute the correlations between the 900 000 OTUs is revelant?

ADD REPLYlink written 11 months ago by vincentpailler110

The use of multiple CPU cores to calculate a correlation matrix can increase the speed [to generate the matrix]; however, it will depend on how the correlation function is designed. I actually wrote a parallelised correlation function in 2016 but I was not happy with it; so, I deleted it...

I think that bigcor can do it relatively quickly. It can do it by computing the correlations in sections and [I believe[ saving these to disk in order to save memory.

Of course, generating the correlation matrix is one thing... after, you will have to filter the data.

ADD REPLYlink written 11 months ago by Kevin Blighe54k
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe54k
Kevin Blighe54k wrote:

You likely have to register the cores that you want to use. Also, if propr does not implement parallel functionality, then not even registering the cores will do anything, and you will have to edit the function code to 'parallelise' it.

I am not familiar with 'Snowfall', however, it likely relates to the fact that parallelisation on Windows (well, pseudo parallelisation) is implemented via SNOW.

If you are indeed using Windows, cores / threads for certain functions have to be registered as a cluster object that implements SNOW functionality, as I show below.

Choose # of cores

# grab max cores available
cores <- makeCluster(detectCores(), type='PSOCK')

# explicitly choose number of cores
cores <- 12


cl <- makeCluster(getOption('cl.cores', cores))

Then, lapply() is implemented via parLapply(cl, ...)

Mac / Linux / UNIX

If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel():

options('mc.cores' = cores)

Then, lapply() is implemented via mclapply(...)


I have written more, here: R functions for parallel processing

Another function in which you may have interest is bigcor.


ADD COMMENTlink modified 11 months ago • written 11 months ago by Kevin Blighe54k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1758 users visited in the last hour