Tutorial: R functions for parallel processing
18
gravatar for Kevin Blighe
18 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

LAST UPDATED: MARCH 19th, 2019

-------------------------------------------

As R is becoming ever more used in bioinformatics, the need for parallel processing is greater due to the sheer amounts of data that is being produced.

This tutorial is a small introduction to how you can start to perform parallel processing in R.

1. set-up system for parallel processing

[a] load doParallel and associated packages

require(doParallel)

[b] set cores / threads

cores <- 12

Note: you can automatically detect the number of CPU cores with

cores <- makeCluster(detectCores(), type='PSOCK')

[c] detect system and then register the cores appropriately

if user is on Windows, cores / threads for certain functions have to be registered differently, i.e., as a cluster object that implements SNOW functionality. If on Mac / Linux, they are registered via 'mc.cores' and as a number of cores to registerDoParallel()

system <- Sys.info()['sysname']

cl <- NULL
if (system == 'Windows') {
    cl <- makeCluster(getOption('cl.cores', cores))
    registerDoParallel(cl)
    registerDoSEQ()
    on.exit(stopCluster(cl))
} else {
    options('mc.cores' = cores)
    registerDoParallel(cores)
}

[d] parallelise lapply()

If on Windows, use:

parLapply(cl, <code>)

If on Mac / Linux:

mclapply(<code>)

[e] parallelise a 'for' loop with foreach

foreach(i = 1:10000) %dopar% {
  prcomp(matrix(rexp(200, rate = .1 * i), ncol = 200))$x
}

-------------------------------------------

2. Practical examples

[a], regression analysis (any type of lm, glm, clogit, et cetera)

NB - code removed as this is now implemented as Bioconductor package, 'RegParallel'

top

[b], correlation analysis

See the full example given here: A: Parallelization to compute correlations

(i) Generate 10000 x 2000 random data matrix

mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
[1]  2000 10000

mat[1:5,1:5]
        Sample1    Sample2    Sample3    Sample4   Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190  5.1977788  0.2951713  8.6172124  4.022473
Gene3 4.1651056  0.5793664  9.3953203  0.8494895 19.985724
Gene4 8.5216108 15.1742012  7.2660707  7.8875059  6.572094
Gene5 8.9048996  4.7932319  1.6599490  1.2448652 25.325664

(ii) use foreach to parallelise the correlation

res <- foreach(i = seq_len(ncol(mat)),
  .combine = rbind,
  .multicombine = TRUE,
  .inorder = FALSE,
  .packages = c('data.table', 'doParallel')) %dopar% {
  cor(mat[,i], mat, method = 'pearson')
}

res[1:10,1:10]
           Sample1      Sample2      Sample3      Sample4      Sample5      Sample6      Sample7      Sample8      Sample9     Sample10
 [1,]  1.000000000 -0.002229854 -0.019986425  0.012332275 -0.028186074  0.001989957 -0.023539602  0.019429899  0.018364991 -0.005468089
 [2,] -0.002229854  1.000000000 -0.028642588  0.043449095 -0.042194471  0.022356480  0.033054025 -0.023498252 -0.032226185 -0.009321138
 [3,] -0.019986425 -0.028642588  1.000000000 -0.027933870  0.006611151  0.047751182 -0.038504742 -0.015981761 -0.044780981  0.042516990
 [4,]  0.012332275  0.043449095 -0.027933870  1.000000000 -0.019726310 -0.011895754 -0.004191525  0.039994518  0.011178578 -0.038820658
 [5,] -0.028186074 -0.042194471  0.006611151 -0.019726310  1.000000000  0.010329296 -0.016766063 -0.004144883  0.032845623 -0.019617268
 [6,]  0.001989957  0.022356480  0.047751182 -0.011895754  0.010329296  1.000000000  0.007669371  0.002309490 -0.010981713 -0.008976607
 [7,] -0.023539602  0.033054025 -0.038504742 -0.004191525 -0.016766063  0.007669371  1.000000000  0.007287154 -0.021833100  0.004763376
 [8,]  0.019429899 -0.023498252 -0.015981761  0.039994518 -0.004144883  0.002309490  0.007287154  1.000000000  0.018914405 -0.012672712
 [9,]  0.018364991 -0.032226185 -0.044780981  0.011178578  0.032845623 -0.010981713 -0.021833100  0.018914405  1.000000000  0.004782734
[10,] -0.005468089 -0.009321138  0.042516990 -0.038820658 -0.019617268 -0.008976607  0.004763376 -0.012672712  0.004782734  1.000000000

(iii) check some results to ensure that we have done it right

table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE 
2500 

table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
 TRUE 
49729

table(cor(mat[,1], mat) == res[1,])
 TRUE 
10000 

table(cor(mat[,6666], mat) == res[6666,])
 TRUE 
10000

[c], clusGapKB

This is an edit of the original clusGap, related to the Gap Statistic by Rob Tibshirani ( https://statweb.stanford.edu/~gwalther/gap ). The edited code is found on my GitHub account: https://github.com/kevinblighe/clusGapKB

For partitioning around medoids (PAM), we can create a custom function that just performs the clustering and ONLY retains the medoids for each k:

CustomPAM <- function(x,k) list(cluster=pam(
  x, k,
  diss=FALSE,
  metric="euclidean",
  medoids=NULL,
  stand=FALSE,
  cluster.only=TRUE,
  do.swap=TRUE,
  keep.diss=FALSE,
  keep.data=FALSE,
  pamonce=TRUE,
  trace.lev=0))

Load testing airway data and process to rlog counts with DESeq2:

library(airway)
data('airway')
library("DESeq2")
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds, betaPrior = FALSE)
rlogcounts <- rlog(dds, blind=TRUE) # or vstcounts <- vst(dds, blind=TRUE)
rlogcounts <- assay(rlogcounts)[1:5000,]

Run clusGapKB:

source("clusGapKB.R") # get from GitHub
gap <- clusGapKB(rlogcounts, FUNcluster=CustomPAM, K.max=20, B=250)

Can also just do k-means clustering:

gap <- clusGapKB(rlogcounts, FUNcluster=kmeans, K.max=20, B=250)

top

ADD COMMENTlink modified 15 hours ago • written 18 months ago by Kevin Blighe39k
2

I learnt the parallel + mclapply trick a few months ago, really useful in a few niche cases IMO.

ADD REPLYlink written 6 months ago by RamRS20k

@Kevin Im using your enhanced function works pretty cool..

ADD REPLYlink written 6 months ago by krushnach80470

Good work, Krushnach! Which function in particular?

ADD REPLYlink written 6 months ago by Kevin Blighe39k

both clusgap and the correlation matrix calculation

ADD REPLYlink written 5 months ago by krushnach80470

Great. I have removed the correlation matrix function, though. It is 'on the shelf' for improvement.

ADD REPLYlink written 5 months ago by Kevin Blighe39k

Krushnach, I have re-added a correlation function that works more efficiently.

ADD REPLYlink written 13 hours ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1195 users visited in the last hour