Tutorial: Parallel processing in R
24
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe45k
Kevin Blighe45k wrote:

LAST UPDATED: MARCH 19th, 2019

-------------------------------------------

As R is becoming ever more used in bioinformatics, the need for parallel processing is greater due to the sheer amounts of data that is being produced.

This tutorial is a small introduction to how you can start to perform parallel processing in R.

1. set-up system for parallel processing

[a] load doParallel and associated packages

require(doParallel)

[b] set cores / threads

cores <- 12

Note: you can automatically detect the number of CPU cores with

cores <- makeCluster(detectCores(), type='PSOCK')

[c] detect system and then register the cores appropriately

if user is on Windows, cores / threads for certain functions have to be registered differently, i.e., as a cluster object that implements SNOW functionality. If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel()

system <- Sys.info()['sysname']

cl <- NULL
if (system == 'Windows') {
    cl <- makeCluster(getOption('cl.cores', cores))
    registerDoParallel(cl)
    registerDoSEQ()
    on.exit(stopCluster(cl))
} else {
    options('mc.cores' = cores)
    registerDoParallel(cores)
}

[d] parallelise lapply()

If on Windows, use:

parLapply(cl, <code>)

If on Mac / Linux:

mclapply(<code>)

[e] parallelise a 'for' loop with foreach

foreach(i = 1:10000) %dopar% {
  prcomp(matrix(rexp(200, rate = .1 * i), ncol = 200))$x
}

-------------------------------------------

2. Practical examples

[a], regression analysis (any type of lm, glm, clogit, et cetera)

NB - code removed as this is now implemented as Bioconductor package, 'RegParallel'

top

[b], correlation analysis

See the full example given here: A: Parallelization to compute correlations

(i) Generate 10000 x 2000 random data matrix

mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
[1]  2000 10000

mat[1:5,1:5]
        Sample1    Sample2    Sample3    Sample4   Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190  5.1977788  0.2951713  8.6172124  4.022473
Gene3 4.1651056  0.5793664  9.3953203  0.8494895 19.985724
Gene4 8.5216108 15.1742012  7.2660707  7.8875059  6.572094
Gene5 8.9048996  4.7932319  1.6599490  1.2448652 25.325664

(ii) use foreach to parallelise the correlation

res <- foreach(i = seq_len(ncol(mat)),
  .combine = rbind,
  .multicombine = TRUE,
  .inorder = FALSE,
  .packages = c('data.table', 'doParallel')) %dopar% {
  cor(mat[,i], mat, method = 'pearson')
}

res[1:10,1:10]
           Sample1      Sample2      Sample3      Sample4      Sample5      Sample6      Sample7      Sample8      Sample9     Sample10
 [1,]  1.000000000 -0.002229854 -0.019986425  0.012332275 -0.028186074  0.001989957 -0.023539602  0.019429899  0.018364991 -0.005468089
 [2,] -0.002229854  1.000000000 -0.028642588  0.043449095 -0.042194471  0.022356480  0.033054025 -0.023498252 -0.032226185 -0.009321138
 [3,] -0.019986425 -0.028642588  1.000000000 -0.027933870  0.006611151  0.047751182 -0.038504742 -0.015981761 -0.044780981  0.042516990
 [4,]  0.012332275  0.043449095 -0.027933870  1.000000000 -0.019726310 -0.011895754 -0.004191525  0.039994518  0.011178578 -0.038820658
 [5,] -0.028186074 -0.042194471  0.006611151 -0.019726310  1.000000000  0.010329296 -0.016766063 -0.004144883  0.032845623 -0.019617268
 [6,]  0.001989957  0.022356480  0.047751182 -0.011895754  0.010329296  1.000000000  0.007669371  0.002309490 -0.010981713 -0.008976607
 [7,] -0.023539602  0.033054025 -0.038504742 -0.004191525 -0.016766063  0.007669371  1.000000000  0.007287154 -0.021833100  0.004763376
 [8,]  0.019429899 -0.023498252 -0.015981761  0.039994518 -0.004144883  0.002309490  0.007287154  1.000000000  0.018914405 -0.012672712
 [9,]  0.018364991 -0.032226185 -0.044780981  0.011178578  0.032845623 -0.010981713 -0.021833100  0.018914405  1.000000000  0.004782734
[10,] -0.005468089 -0.009321138  0.042516990 -0.038820658 -0.019617268 -0.008976607  0.004763376 -0.012672712  0.004782734  1.000000000

(iii) check some results to ensure that we have done it right

table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE 
2500 

table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
 TRUE 
49729

table(cor(mat[,1], mat) == res[1,])
 TRUE 
10000 

table(cor(mat[,6666], mat) == res[6666,])
 TRUE 
10000

[c], clusGapKB

This is an edit of the original clusGap, related to the Gap Statistic by Rob Tibshirani ( https://statweb.stanford.edu/~gwalther/gap ). The edited code is found on my GitHub account: https://github.com/kevinblighe/clusGapKB

For partitioning around medoids (PAM), we can create a custom function that just performs the clustering and ONLY retains the medoids for each k:

CustomPAM <- function(x,k) list(cluster=pam(
  x, k,
  diss=FALSE,
  metric="euclidean",
  medoids=NULL,
  stand=FALSE,
  cluster.only=TRUE,
  do.swap=TRUE,
  keep.diss=FALSE,
  keep.data=FALSE,
  pamonce=TRUE,
  trace.lev=0))

Load testing airway data and process to rlog counts with DESeq2:

library(airway)
data('airway')
library("DESeq2")
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds, betaPrior = FALSE)
rlogcounts <- rlog(dds, blind=TRUE) # or vstcounts <- vst(dds, blind=TRUE)
rlogcounts <- assay(rlogcounts)[1:5000,]

Run clusGapKB:

source("clusGapKB.R") # get from GitHub
gap <- clusGapKB(rlogcounts, FUNcluster=CustomPAM, K.max=20, B=250)

Can also just do k-means clustering:

gap <- clusGapKB(rlogcounts, FUNcluster=kmeans, K.max=20, B=250)

top

ADD COMMENTlink modified 18 days ago • written 22 months ago by Kevin Blighe45k
2

I learnt the parallel + mclapply trick a few months ago, really useful in a few niche cases IMO.

ADD REPLYlink written 10 months ago by RamRS22k

@Kevin Im using your enhanced function works pretty cool..

ADD REPLYlink written 10 months ago by krushnach80530

Good work, Krushnach! Which function in particular?

ADD REPLYlink written 10 months ago by Kevin Blighe45k
1

both clusgap and the correlation matrix calculation

ADD REPLYlink written 9 months ago by krushnach80530

Great. I have removed the correlation matrix function, though. It is 'on the shelf' for improvement.

ADD REPLYlink written 9 months ago by Kevin Blighe45k

Krushnach, I have re-added a correlation function that works more efficiently.

ADD REPLYlink written 4 months ago by Kevin Blighe45k

Great help thanks, Will gonna try this today.

ADD REPLYlink written 7 weeks ago by stephannie.baker80

Thanks. Suppose I was writing a script / package for which I was uncertain of user's OS beforehand. Is there a wrapper package that can choose which parallelisation tool to call?

ADD REPLYlink written 18 days ago by russhh4.6k
2

Hey russhh, yes, I do that in my package RegParallel

Basically, you just need the code from part 1(abc) (above) in order to register cores / threads on any system. The OS type will then be stored in the variable system

After that, foreach will work on any system. For parallelised lapply, you'll need to distinguish based on whether one is using Windows or not:

if (system == 'Windows') {
  parLapply(cl, ...)
} else {
  mclapply(...)
}
ADD REPLYlink modified 18 days ago • written 18 days ago by Kevin Blighe45k
3
gravatar for Nam Le Quang
18 days ago by
Nam Le Quang70
Viet Nam / Ha Noi / National Institute of Animal Sciences
Nam Le Quang70 wrote:

For correlation analysis, fastCor is much better than foreach: https://rdrr.io/cran/HiClimR/man/fastCor.html

ADD COMMENTlink written 18 days ago by Nam Le Quang70

Well, that was a very un-scientific comment. Please better define "much better"?

ADD REPLYlink written 18 days ago by Kevin Blighe45k

"much better" means no more RAM needed and faster than foreach.

ADD REPLYlink written 17 days ago by Nam Le Quang70
1

That's interesting. Would you happen to have any proof to support your claim? Maybe performance benchmarks of some sort?

ADD REPLYlink written 17 days ago by RamRS22k
2

Using the above code:

registerDoParallel(cores = detectCores()) # using 8 cores

set.seed(1)
mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))

foreach was too long, I couldn't wait and stopped it before finishing. The RAM consuming was more than 4 GB.

t0 <- proc.time(); res <- foreach(i = seq_len(ncol(mat)),
               .combine = rbind,
               .multicombine = TRUE,
               .inorder = FALSE,
               .packages = c('data.table', 'doParallel')) %dopar% {
                 cor(mat[,i], mat, method = 'pearson')
               }; proc.time() - t0

fastCor took about 2 GB RAM:

library(HiClimR)

t0 <- proc.time() ; cors <- fastCor(mat, nSplit=1, upperTri=F, optBLAS=T); proc.time() - t0

user system elapsed

19.484 3.122 5.669

Now it is your choice.

ADD REPLYlink modified 14 days ago • written 14 days ago by Nam Le Quang70
2

Yes, fastCor only calculates half the resulting correlation matrix - the 'upper' triangle. This is why [in part] it is quicker and uses less RAM. The code in this thread is primarily to show people how to enable parallel processing in R and then utilise this in their own functions - there are no good tutorials out there about how to do this. It should be pointed out, however, that the complete correlation matrix is required for certain downstream functions.

Thank you for your comments, Nam. Cảm ơn

ADD REPLYlink modified 14 days ago • written 14 days ago by Kevin Blighe45k
2

You know Vietnamese, that is wonderful!

Using upperTri=F as in above example will give us the the full correlation matrix. fastCor is fast because it uses an optimized BLAS library (ATLAS, OpenBLAS, or Intel MKL).

ADD REPLYlink written 14 days ago by Nam Le Quang70

Thank you. I will look at BLAS library.

ADD REPLYlink written 14 days ago by Kevin Blighe45k
2

Another reply to this: patience is a virtue, and is something that not many people appear to have:

t0 <- proc.time(); res <- foreach(i = seq_len(ncol(mat)),
               .combine = rbind,
               .multicombine = TRUE,
               .inorder = FALSE,
               .packages = c('data.table', 'doParallel')) %dopar% {
                 cor(mat[,i], mat, method = 'pearson')
               }; proc.time() - t0

usuário   sistema decorrido 
3319.020    40.404  1288.297 


require("HiClimR")
Carregando pacotes exigidos: HiClimR
t0 <- proc.time() ; cors <- fastCor(mat, nSplit=1, upperTri=F, optBLAS=T); proc.time() - t0
usuário   sistema decorrido 
 45.492     1.780    47.398
ADD REPLYlink modified 14 days ago • written 14 days ago by Kevin Blighe45k

Thank you again for this. I think that fastcor is a very useful package to emphasise in this thread, so, I have moved your comment to an answer.

ADD REPLYlink written 20 hours ago by Kevin Blighe45k

are you going to implement "fastcor" in your parallel package?

ADD REPLYlink written 2 hours ago by krushnach80530

No no, well, this tutorial (this web-page) is mainly to allow users to set-up their R environment for parallel processing. The user can then do anything that they want :)

ADD REPLYlink written 2 hours ago by Kevin Blighe45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1583 users visited in the last hour