Question: Snowfall Parallelisation to compute Correlation Matrix
0
gravatar for vincentpailler
17 days ago by
vincentpailler50 wrote:

Hi everyone,

I try to deal with parallelization and the R package "Snowfall" .

I have my R code which looks like (propr is a package to compute correlation between counts from metagenomics data)

library(propr) test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t") test<-t(test) propr<-propr(test, metric="rho")

The correlation matrix I want to generate is very huge (about 5To), that's why I try to get used to parallelization to compute it (I have 12T of memory and many CPUs, I work on a cluster, so the computing power is not a problem)

But I really don't understand how to incorporate my R code into a snowfall code.. Does someone would know how to do it?

Bests, Vincent

parallel snowfall • 146 views
ADD COMMENTlink modified 16 days ago • written 17 days ago by vincentpailler50

Thanks for answering me.

How could I know if propr implements parallel functionality ?

My code looks like this :

library(doParallel)
library(propr) 

cores <- 128
options('mc.cores' = cores)
registerDoParallel(cores)

read.table("truelength2.prok2.uniref2.rares.tsv", h=T, row.names=1, sep="\t")->data
mclapply(propr(t(data), metric="rho"))-> parallel
size=format(object.size(parallel), units="Gb")

Does it look fine or not?

Thanks

ADD REPLYlink modified 16 days ago by Kevin Blighe39k • written 16 days ago by vincentpailler50

No, that will not do anything. Also, why do you assign right with -> (just curious).

You should look up how mclapply() works. It functions in exactly the same way as lapply(). Just looking at your code, it may be something like:

mclapply(t(data), function(x) propr(x))

In pseudo code: Apply the function propr() to t(data)

I do not know anything about propr(), though. What is it doing to your data? You should study how it works internally, if possible, to see how it could be parallelised in different ways.

Sometimes one has to edit the internal code to enable parallelisation, like I did for clusGap: https://github.com/kevinblighe/clusGapKB

ADD REPLYlink modified 16 days ago • written 16 days ago by Kevin Blighe39k

I don't really know why I assign right, I get used to I guess

propr() is a package which computes correlation between compositional data , so :

  • my data is a matrix which matches abundances of OTUs within different samples (about 900 000 OTUs and 40 samples) And propr() could help me to compute correlation between these OTUs which gives an huge matrix ((900000*899999)/2 correlations)
ADD REPLYlink written 16 days ago by vincentpailler50

I do not see anything in the propr documentation that indicates that it is designed for parallel processing, so, even registering cores will have no effect.

I looked at the actual code of the function, too, and I can see that it is not doing anything related to parallel-processing. When you set it to do correlation, in fact, it just uses the cor() function from the base stats package.

Did you not try the bigcor package?

I cannot see everything that you are trying at your console, so, my suggestions may be irrelevant.

ADD REPLYlink modified 16 days ago • written 16 days ago by Kevin Blighe39k

Thanks for the answer

Actually, I tried bigcor which looks fine (I just tried on my computer with a reduced data set (correlation between 15 000 OTUs), I can't work on a cluster at the moment)

I will try on the cluster tomorrow on my main dataset (correlation between 900 000 OTUs), when I can allocate lots of more memory (until 12To)

Does the number of CPUs allocated to compute the correlations between the 900 000 OTUs is revelant?

ADD REPLYlink written 16 days ago by vincentpailler50

The use of multiple CPU cores to calculate a correlation matrix can increase the speed [to generate the matrix]; however, it will depend on how the correlation function is designed. I actually wrote a parallelised correlation function in 2016 but I was not happy with it; so, I deleted it...

I think that bigcor can do it relatively quickly. It can do it by computing the correlations in sections and [I believe[ saving these to disk in order to save memory.

Of course, generating the correlation matrix is one thing... after, you will have to filter the data.

ADD REPLYlink written 16 days ago by Kevin Blighe39k
1
gravatar for Kevin Blighe
17 days ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

You likely have to register the cores that you want to use. Also, if propr does not implement parallel functionality, then not even registering the cores will do anything, and you will have to edit the function code to 'parallelise' it.

I am not familiar with 'Snowfall', however, it likely relates to the fact that parallelisation on Windows (well, pseudo parallelisation) is implemented via SNOW.

If you are indeed using Windows, cores / threads for certain functions have to be registered as a cluster object that implements SNOW functionality, as I show below.

Choose # of cores

# grab max cores available
cores <- makeCluster(detectCores(), type='PSOCK')

# explicitly choose number of cores
cores <- 12

Windows

cl <- makeCluster(getOption('cl.cores', cores))
registerDoParallel(cl)
registerDoSEQ()
on.exit(stopCluster(cl))

Then, lapply() is implemented via parLapply(cl, ...)

Mac / Linux / UNIX

If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel():

options('mc.cores' = cores)
registerDoParallel(cores)

Then, lapply() is implemented via mclapply(...)

-----------------------------------

I have written more, here: R functions for parallel processing

Another function in which you may have interest is bigcor.

Kevin

ADD COMMENTlink modified 17 days ago • written 17 days ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2125 users visited in the last hour