Tutorial:Parallel processing in R
1
39
Entering edit mode
4.0 years ago

LAST UPDATED: November 2nd, 2019

-------------------------------------------

As R is becoming ever more used in bioinformatics, the need for parallel processing is greater due to the sheer amounts of data that is being produced.

This tutorial is a small introduction to how you can start to perform parallel processing in R.

1. set-up system for parallel processing

[a] load doParallel and associated packages

require(doParallel)

[b] set cores / threads

cores <- 12

Note: you can automatically detect the number of CPU cores with

cores <- makeCluster(detectCores(), type='PSOCK')

[c] detect system and then register the cores appropriately

if user is on Windows, cores / threads for certain functions have to be registered differently, i.e., as a cluster object that implements SNOW functionality. If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel()

system <- Sys.info()['sysname']

cl <- NULL
if (system == 'Windows') {
    cl <- makeCluster(getOption('cl.cores', cores))
    registerDoParallel(cl)
    registerDoSEQ()
    on.exit(stopCluster(cl))
} else {
    options('mc.cores' = cores)
    registerDoParallel(cores)
}

[d] parallelise lapply()

If on Windows, use:

parLapply(cl, <code>)

If on Mac / Linux:

mclapply(<code>)

[e] parallelise a 'for' loop with foreach

foreach(i = 1:10000) %dopar% {
  prcomp(matrix(rexp(200, rate = .1 * i), ncol = 200))$x
}

-------------------------------------------

2. Practical examples

[a], regression analysis (any type of lm, glm, clogit, et cetera)

NB - code removed as this is now implemented as Bioconductor package:

[b], correlation analysis

(i) Generate 10000 x 2000 random data matrix

mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
[1]  2000 10000

mat[1:5,1:5]
        Sample1    Sample2    Sample3    Sample4   Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190  5.1977788  0.2951713  8.6172124  4.022473
Gene3 4.1651056  0.5793664  9.3953203  0.8494895 19.985724
Gene4 8.5216108 15.1742012  7.2660707  7.8875059  6.572094
Gene5 8.9048996  4.7932319  1.6599490  1.2448652 25.325664

(ii) use foreach to parallelise the correlation

res <- foreach(i = seq_len(ncol(mat)),
  .combine = rbind,
  .multicombine = TRUE,
  .inorder = FALSE,
  .packages = c('data.table', 'doParallel')) %dopar% {
  cor(mat[,i], mat, method = 'pearson')
}

res[1:10,1:10]
           Sample1      Sample2      Sample3      Sample4      Sample5      Sample6      Sample7      Sample8      Sample9     Sample10
 [1,]  1.000000000 -0.002229854 -0.019986425  0.012332275 -0.028186074  0.001989957 -0.023539602  0.019429899  0.018364991 -0.005468089
 [2,] -0.002229854  1.000000000 -0.028642588  0.043449095 -0.042194471  0.022356480  0.033054025 -0.023498252 -0.032226185 -0.009321138
 [3,] -0.019986425 -0.028642588  1.000000000 -0.027933870  0.006611151  0.047751182 -0.038504742 -0.015981761 -0.044780981  0.042516990
 [4,]  0.012332275  0.043449095 -0.027933870  1.000000000 -0.019726310 -0.011895754 -0.004191525  0.039994518  0.011178578 -0.038820658
 [5,] -0.028186074 -0.042194471  0.006611151 -0.019726310  1.000000000  0.010329296 -0.016766063 -0.004144883  0.032845623 -0.019617268
 [6,]  0.001989957  0.022356480  0.047751182 -0.011895754  0.010329296  1.000000000  0.007669371  0.002309490 -0.010981713 -0.008976607
 [7,] -0.023539602  0.033054025 -0.038504742 -0.004191525 -0.016766063  0.007669371  1.000000000  0.007287154 -0.021833100  0.004763376
 [8,]  0.019429899 -0.023498252 -0.015981761  0.039994518 -0.004144883  0.002309490  0.007287154  1.000000000  0.018914405 -0.012672712
 [9,]  0.018364991 -0.032226185 -0.044780981  0.011178578  0.032845623 -0.010981713 -0.021833100  0.018914405  1.000000000  0.004782734
[10,] -0.005468089 -0.009321138  0.042516990 -0.038820658 -0.019617268 -0.008976607  0.004763376 -0.012672712  0.004782734  1.000000000

(iii) check some results to ensure that we have done it right

table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE 
2500 

table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
 TRUE 
49729

table(cor(mat[,1], mat) == res[1,])
 TRUE 
10000 

table(cor(mat[,6666], mat) == res[6666,])
 TRUE 
10000

[c], clusGapKB

This is an edit of the original clusGap, related to the Gap Statistic by Rob Tibshirani ( https://statweb.stanford.edu/~gwalther/gap ). The edited code is found on my GitHub account: https://github.com/kevinblighe/clusGapKB

Load testing airway data and process to rlog counts with DESeq2:

library(airway)
data('airway')
library("DESeq2")
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds, betaPrior = FALSE)
rlogcounts <- rlog(dds, blind=TRUE) # or vstcounts <- vst(dds, blind=TRUE)
rlogcounts <- assay(rlogcounts)[1:5000,]

Run clusGapKB with 12 cores:

source("clusGapKB.R") # get from GitHub
gap <- clusGapKB(rlogcounts, FUNcluster=kmeans, K.max=20, B=250)

top

R Tutorial biocparallel mclapply foreach • 8.5k views
ADD COMMENT
2
Entering edit mode

I learnt the parallel + mclapply trick a few months ago, really useful in a few niche cases IMO.

ADD REPLY
0
Entering edit mode

@Kevin Im using your enhanced function works pretty cool..

ADD REPLY
0
Entering edit mode

Good work, Krushnach! Which function in particular?

ADD REPLY
1
Entering edit mode

both clusgap and the correlation matrix calculation

ADD REPLY
0
Entering edit mode

Great. I have removed the correlation matrix function, though. It is 'on the shelf' for improvement.

ADD REPLY
0
Entering edit mode

Krushnach, I have re-added a correlation function that works more efficiently.

ADD REPLY
0
Entering edit mode

Great help thanks, Will gonna try this today.

ADD REPLY
0
Entering edit mode

Thanks. Suppose I was writing a script / package for which I was uncertain of user's OS beforehand. Is there a wrapper package that can choose which parallelisation tool to call?

ADD REPLY
2
Entering edit mode

Hey russhh, yes, I do that in my package RegParallel

Basically, you just need the code from part 1(abc) (above) in order to register cores / threads on any system. The OS type will then be stored in the variable system

After that, foreach will work on any system. For parallelised lapply, you'll need to distinguish based on whether one is using Windows or not:

if (system == 'Windows') {
  parLapply(cl, ...)
} else {
  mclapply(...)
}
ADD REPLY
3
Entering edit mode
2.2 years ago
Nam Le Quang ▴ 70

For correlation analysis, fastCor is much better than foreach: https://rdrr.io/cran/HiClimR/man/fastCor.html

ADD COMMENT
1
Entering edit mode

Thank you again for this. I think that fastcor is a very useful package to emphasise in this thread, so, I have moved your comment to an answer.

ADD REPLY
0
Entering edit mode

are you going to implement "fastcor" in your parallel package?

ADD REPLY
0
Entering edit mode

No no, well, this tutorial (this web-page) is mainly to allow users to set-up their R environment for parallel processing. The user can then do anything that they want :)

ADD REPLY
0
Entering edit mode

Please better define "much better"?

ADD REPLY
0
Entering edit mode

"much better" means no more RAM needed and faster than foreach.

ADD REPLY
1
Entering edit mode

That's interesting. Would you happen to have any proof to support your claim? Maybe performance benchmarks of some sort?

ADD REPLY
2
Entering edit mode

Using the above code:

registerDoParallel(cores = detectCores()) # using 8 cores

set.seed(1)
mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))

foreach was too long, I couldn't wait and stopped it before finishing. The RAM consuming was more than 4 GB.

t0 <- proc.time(); res <- foreach(i = seq_len(ncol(mat)),
               .combine = rbind,
               .multicombine = TRUE,
               .inorder = FALSE,
               .packages = c('data.table', 'doParallel')) %dopar% {
                 cor(mat[,i], mat, method = 'pearson')
               }; proc.time() - t0

fastCor took about 2 GB RAM:

library(HiClimR)

t0 <- proc.time() ; cors <- fastCor(mat, nSplit=1, upperTri=F, optBLAS=T); proc.time() - t0

user system elapsed

19.484 3.122 5.669

Now it is your choice.

ADD REPLY
2
Entering edit mode

Yes, fastCor only calculates half the resulting correlation matrix - the 'upper' triangle. This is why [in part] it is quicker and uses less RAM. The code in this thread is primarily to show people how to enable parallel processing in R and then utilise this in their own functions - there are no good tutorials out there about how to do this. It should be pointed out, however, that the complete correlation matrix is required for certain downstream functions.

Thank you for your comments, Nam. Cảm ơn

ADD REPLY
2
Entering edit mode

You know Vietnamese, that is wonderful!

Using upperTri=F as in above example will give us the the full correlation matrix. fastCor is fast because it uses an optimized BLAS library (ATLAS, OpenBLAS, or Intel MKL).

ADD REPLY
0
Entering edit mode

Thank you. I will look at BLAS library.

ADD REPLY
2
Entering edit mode

Another reply to this: patience is a virtue, and is something that not many people appear to have:

t0 <- proc.time(); res <- foreach(i = seq_len(ncol(mat)),
               .combine = rbind,
               .multicombine = TRUE,
               .inorder = FALSE,
               .packages = c('data.table', 'doParallel')) %dopar% {
                 cor(mat[,i], mat, method = 'pearson')
               }; proc.time() - t0

usuário   sistema decorrido 
3319.020    40.404  1288.297 


require("HiClimR")
Carregando pacotes exigidos: HiClimR
t0 <- proc.time() ; cors <- fastCor(mat, nSplit=1, upperTri=F, optBLAS=T); proc.time() - t0
usuário   sistema decorrido 
 45.492     1.780    47.398
ADD REPLY
0
Entering edit mode

in the edited version of the clusgap function what is the "B" argument for? got it bootstrapping

ADD REPLY
1
Entering edit mode

Hey Krushnach, B is indeed the number of bootstraps. The Gap Statistic is, by now, a relatively slow way of identifying clusters in a dataset. There are much quicker methods out there.

ADD REPLY
0
Entering edit mode

other methods ? can you cite a few

ADD REPLY
1
Entering edit mode

Oh, well, there are:

  • Elbow method
  • Silhouette method
  • Method employed in ConsensusClusterPlus (not sure what it uses)
  • Method employed in Seurat's FindClusters() (a k-nn approach - k-nearest neighbours, with a further community detection algorithm)
ADD REPLY
0
Entering edit mode

"Silhouette method" i was trying this which i though this way first i would find the optimum number of cluster using gap statistic method then i will find using "Silhouette method" but both of them not so similar result .

But the seurat method you wrote about that i read for single cell , conceptual doubt would that work for bulk rna seq data? it will work of course just numbers but would that me meaningful in terms of biology as compared to single cell data

ADD REPLY

Login before adding your answer.

Traffic: 2150 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6