Tutorial: Parallel processing in R
34
gravatar for Kevin Blighe
2.7 years ago by
Kevin Blighe60k
Kevin Blighe60k wrote:

LAST UPDATED: November 2nd, 2019

-------------------------------------------

As R is becoming ever more used in bioinformatics, the need for parallel processing is greater due to the sheer amounts of data that is being produced.

This tutorial is a small introduction to how you can start to perform parallel processing in R.

1. set-up system for parallel processing

[a] load doParallel and associated packages

require(doParallel)

[b] set cores / threads

cores <- 12

Note: you can automatically detect the number of CPU cores with

cores <- makeCluster(detectCores(), type='PSOCK')

[c] detect system and then register the cores appropriately

if user is on Windows, cores / threads for certain functions have to be registered differently, i.e., as a cluster object that implements SNOW functionality. If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel()

system <- Sys.info()['sysname']

cl <- NULL
if (system == 'Windows') {
    cl <- makeCluster(getOption('cl.cores', cores))
    registerDoParallel(cl)
    registerDoSEQ()
    on.exit(stopCluster(cl))
} else {
    options('mc.cores' = cores)
    registerDoParallel(cores)
}

[d] parallelise lapply()

If on Windows, use:

parLapply(cl, <code>)

If on Mac / Linux:

mclapply(<code>)

[e] parallelise a 'for' loop with foreach

foreach(i = 1:10000) %dopar% {
  prcomp(matrix(rexp(200, rate = .1 * i), ncol = 200))$x
}

-------------------------------------------

2. Practical examples

[a], regression analysis (any type of lm, glm, clogit, et cetera)

NB - code removed as this is now implemented as Bioconductor package:

[b], correlation analysis

(i) Generate 10000 x 2000 random data matrix

mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
[1]  2000 10000

mat[1:5,1:5]
        Sample1    Sample2    Sample3    Sample4   Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190  5.1977788  0.2951713  8.6172124  4.022473
Gene3 4.1651056  0.5793664  9.3953203  0.8494895 19.985724
Gene4 8.5216108 15.1742012  7.2660707  7.8875059  6.572094
Gene5 8.9048996  4.7932319  1.6599490  1.2448652 25.325664

(ii) use foreach to parallelise the correlation

res <- foreach(i = seq_len(ncol(mat)),
  .combine = rbind,
  .multicombine = TRUE,
  .inorder = FALSE,
  .packages = c('data.table', 'doParallel')) %dopar% {
  cor(mat[,i], mat, method = 'pearson')
}

res[1:10,1:10]
           Sample1      Sample2      Sample3      Sample4      Sample5      Sample6      Sample7      Sample8      Sample9     Sample10
 [1,]  1.000000000 -0.002229854 -0.019986425  0.012332275 -0.028186074  0.001989957 -0.023539602  0.019429899  0.018364991 -0.005468089
 [2,] -0.002229854  1.000000000 -0.028642588  0.043449095 -0.042194471  0.022356480  0.033054025 -0.023498252 -0.032226185 -0.009321138
 [3,] -0.019986425 -0.028642588  1.000000000 -0.027933870  0.006611151  0.047751182 -0.038504742 -0.015981761 -0.044780981  0.042516990
 [4,]  0.012332275  0.043449095 -0.027933870  1.000000000 -0.019726310 -0.011895754 -0.004191525  0.039994518  0.011178578 -0.038820658
 [5,] -0.028186074 -0.042194471  0.006611151 -0.019726310  1.000000000  0.010329296 -0.016766063 -0.004144883  0.032845623 -0.019617268
 [6,]  0.001989957  0.022356480  0.047751182 -0.011895754  0.010329296  1.000000000  0.007669371  0.002309490 -0.010981713 -0.008976607
 [7,] -0.023539602  0.033054025 -0.038504742 -0.004191525 -0.016766063  0.007669371  1.000000000  0.007287154 -0.021833100  0.004763376
 [8,]  0.019429899 -0.023498252 -0.015981761  0.039994518 -0.004144883  0.002309490  0.007287154  1.000000000  0.018914405 -0.012672712
 [9,]  0.018364991 -0.032226185 -0.044780981  0.011178578  0.032845623 -0.010981713 -0.021833100  0.018914405  1.000000000  0.004782734
[10,] -0.005468089 -0.009321138  0.042516990 -0.038820658 -0.019617268 -0.008976607  0.004763376 -0.012672712  0.004782734  1.000000000

(iii) check some results to ensure that we have done it right

table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE 
2500 

table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
 TRUE 
49729

table(cor(mat[,1], mat) == res[1,])
 TRUE 
10000 

table(cor(mat[,6666], mat) == res[6666,])
 TRUE 
10000

[c], clusGapKB

This is an edit of the original clusGap, related to the Gap Statistic by Rob Tibshirani ( https://statweb.stanford.edu/~gwalther/gap ). The edited code is found on my GitHub account: https://github.com/kevinblighe/clusGapKB

Load testing airway data and process to rlog counts with DESeq2:

library(airway)
data('airway')
library("DESeq2")
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds, betaPrior = FALSE)
rlogcounts <- rlog(dds, blind=TRUE) # or vstcounts <- vst(dds, blind=TRUE)
rlogcounts <- assay(rlogcounts)[1:5000,]

Run clusGapKB with 12 cores:

source("clusGapKB.R") # get from GitHub
gap <- clusGapKB(rlogcounts, FUNcluster=kmeans, K.max=20, B=250)

top

ADD COMMENTlink modified 7 months ago • written 2.7 years ago by Kevin Blighe60k
2

I learnt the parallel + mclapply trick a few months ago, really useful in a few niche cases IMO.

ADD REPLYlink written 21 months ago by RamRS27k

@Kevin Im using your enhanced function works pretty cool..

ADD REPLYlink written 20 months ago by krushnach80750

Good work, Krushnach! Which function in particular?

ADD REPLYlink written 20 months ago by Kevin Blighe60k
1

both clusgap and the correlation matrix calculation

ADD REPLYlink written 20 months ago by krushnach80750

Great. I have removed the correlation matrix function, though. It is 'on the shelf' for improvement.

ADD REPLYlink written 20 months ago by Kevin Blighe60k

Krushnach, I have re-added a correlation function that works more efficiently.

ADD REPLYlink written 14 months ago by Kevin Blighe60k

Great help thanks, Will gonna try this today.

ADD REPLYlink written 12 months ago by stephannie.baker810

Thanks. Suppose I was writing a script / package for which I was uncertain of user's OS beforehand. Is there a wrapper package that can choose which parallelisation tool to call?

ADD REPLYlink written 11 months ago by russhh5.4k
2

Hey russhh, yes, I do that in my package RegParallel

Basically, you just need the code from part 1(abc) (above) in order to register cores / threads on any system. The OS type will then be stored in the variable system

After that, foreach will work on any system. For parallelised lapply, you'll need to distinguish based on whether one is using Windows or not:

if (system == 'Windows') {
  parLapply(cl, ...)
} else {
  mclapply(...)
}
ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe60k
3
gravatar for Nam Le Quang
11 months ago by
Nam Le Quang70
Viet Nam / Ha Noi / National Institute of Animal Sciences
Nam Le Quang70 wrote:

For correlation analysis, fastCor is much better than foreach: https://rdrr.io/cran/HiClimR/man/fastCor.html

ADD COMMENTlink written 11 months ago by Nam Le Quang70
1

Thank you again for this. I think that fastcor is a very useful package to emphasise in this thread, so, I have moved your comment to an answer.

ADD REPLYlink written 10 months ago by Kevin Blighe60k

are you going to implement "fastcor" in your parallel package?

ADD REPLYlink written 10 months ago by krushnach80750

No no, well, this tutorial (this web-page) is mainly to allow users to set-up their R environment for parallel processing. The user can then do anything that they want :)

ADD REPLYlink written 10 months ago by Kevin Blighe60k

Please better define "much better"?

ADD REPLYlink modified 10 months ago • written 11 months ago by Kevin Blighe60k

"much better" means no more RAM needed and faster than foreach.

ADD REPLYlink written 11 months ago by Nam Le Quang70
1

That's interesting. Would you happen to have any proof to support your claim? Maybe performance benchmarks of some sort?

ADD REPLYlink written 11 months ago by RamRS27k
2

Using the above code:

registerDoParallel(cores = detectCores()) # using 8 cores

set.seed(1)
mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))

foreach was too long, I couldn't wait and stopped it before finishing. The RAM consuming was more than 4 GB.

t0 <- proc.time(); res <- foreach(i = seq_len(ncol(mat)),
               .combine = rbind,
               .multicombine = TRUE,
               .inorder = FALSE,
               .packages = c('data.table', 'doParallel')) %dopar% {
                 cor(mat[,i], mat, method = 'pearson')
               }; proc.time() - t0

fastCor took about 2 GB RAM:

library(HiClimR)

t0 <- proc.time() ; cors <- fastCor(mat, nSplit=1, upperTri=F, optBLAS=T); proc.time() - t0

user system elapsed

19.484 3.122 5.669

Now it is your choice.

ADD REPLYlink modified 11 months ago • written 11 months ago by Nam Le Quang70
2

Yes, fastCor only calculates half the resulting correlation matrix - the 'upper' triangle. This is why [in part] it is quicker and uses less RAM. The code in this thread is primarily to show people how to enable parallel processing in R and then utilise this in their own functions - there are no good tutorials out there about how to do this. It should be pointed out, however, that the complete correlation matrix is required for certain downstream functions.

Thank you for your comments, Nam. Cảm ơn

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe60k
2

You know Vietnamese, that is wonderful!

Using upperTri=F as in above example will give us the the full correlation matrix. fastCor is fast because it uses an optimized BLAS library (ATLAS, OpenBLAS, or Intel MKL).

ADD REPLYlink written 11 months ago by Nam Le Quang70

Thank you. I will look at BLAS library.

ADD REPLYlink written 11 months ago by Kevin Blighe60k
2

Another reply to this: patience is a virtue, and is something that not many people appear to have:

t0 <- proc.time(); res <- foreach(i = seq_len(ncol(mat)),
               .combine = rbind,
               .multicombine = TRUE,
               .inorder = FALSE,
               .packages = c('data.table', 'doParallel')) %dopar% {
                 cor(mat[,i], mat, method = 'pearson')
               }; proc.time() - t0

usuário   sistema decorrido 
3319.020    40.404  1288.297 


require("HiClimR")
Carregando pacotes exigidos: HiClimR
t0 <- proc.time() ; cors <- fastCor(mat, nSplit=1, upperTri=F, optBLAS=T); proc.time() - t0
usuário   sistema decorrido 
 45.492     1.780    47.398
ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe60k

in the edited version of the clusgap function what is the "B" argument for? got it bootstrapping

ADD REPLYlink modified 8 months ago • written 8 months ago by krushnach80750
1

Hey Krushnach, B is indeed the number of bootstraps. The Gap Statistic is, by now, a relatively slow way of identifying clusters in a dataset. There are much quicker methods out there.

ADD REPLYlink written 8 months ago by Kevin Blighe60k

other methods ? can you cite a few

ADD REPLYlink written 8 months ago by krushnach80750
1

Oh, well, there are:

  • Elbow method
  • Silhouette method
  • Method employed in ConsensusClusterPlus (not sure what it uses)
  • Method employed in Seurat's FindClusters() (a k-nn approach - k-nearest neighbours, with a further community detection algorithm)
ADD REPLYlink written 8 months ago by Kevin Blighe60k

"Silhouette method" i was trying this which i though this way first i would find the optimum number of cluster using gap statistic method then i will find using "Silhouette method" but both of them not so similar result .

But the seurat method you wrote about that i read for single cell , conceptual doubt would that work for bulk rna seq data? it will work of course just numbers but would that me meaningful in terms of biology as compared to single cell data

ADD REPLYlink written 8 months ago by krushnach80750
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1286 users visited in the last hour