How To Run R Script On Multiple Core To Process Huge Biological Data?
6
2
Entering edit mode
7.6 years ago
jack ▴ 490

Hi all,

I've written R script. The purpose of this script is processing expression data and building around 100000 models from it. Since the computation load is too high, it's necessary to run this script on multiple cores and when I run it on the server, it is just using one of the CPUs. Is there a way to run R script on multiple CPUs?

r programming • 15k views
0
Entering edit mode

Sorry, this is a pure programming question and therefore off-topic. closing

4
Entering edit mode

Parallelized R work is certainly relevant to bioinformatics. We also have several questions of this sort ("pure programming") open elsewhere (example currently on the front page: best free text editor for all popular languages (R, Python, Perl etc..)) so I'm going to reopen.

0
Entering edit mode

It is relevant, interesting and off-topic, as pure python programming is.

1
Entering edit mode

I'm not sure what this means, but I think we can assume good faith of users asking questions, who have asked other bioinformatics-focused questions, that their question here is also relevant to this site's mission, in the spirit of sharing and furthering knowledge. Reopening.

1
Entering edit mode

We would normally ask users to edit the question so as to indicate relevance to a bioinformatics research problem; then close if they do not comply. As it stands, this is a pure R programming question more suited to e.g. StackOverflow. Where it would be downvoted for insufficient prior research :)

1
Entering edit mode

SO loves a good down vote, for better or worse.

@OP: Perhaps you might want to give us an idea of what it is you want to compute and we can improve our responses.

0
Entering edit mode

Seems like this thread was closed prematurely, then. But I still do not understand the reflexive presumption of bad faith or irrelevance (unless the post is obviously spam or something of that nature). We're all here to learn, and this subject matter is certainly applicable to many bioinformatic analyses.

0
Entering edit mode

I don't think the issue is "reflexive presumption". We're just trying to keep the forum on-topic so as when people visit, they see a bioinformatics Q & A, not a programming Q & A. Subject matter is often applicable, but we want that application to be explicit.

0
Entering edit mode

There is nothing reflexive about trying to maintain the site, and I don't presume anything about the "faith" or "bad faith" (or maybe better presumed intention, we are scientists after all) of OP. We are trying hard to keep the forum on-topic. There are lots of topics that bioinformatics builds on and pure programming is one of them, nevertheless I have the feeling that such questions do belong elsewhere if no connection to biology is visible.

Also, this question was originally a single sentence question, something which most of the time is a sign of lack of research or lack of detail or laziness. those questions are after my standards more prone to be closed.

0
Entering edit mode

Im not going to engage in a close-reopen war, but you have just acknowledged that there is no obvious connection to bioinformatics, because you had to construct one in good faith. The relation to bioinformatics should be obvious from the question, not be inferred. In fact it is extremely obvious that it is off-topic.

7
Entering edit mode
7.6 years ago

Try the parallel, doMC, doParallel or multicore packages if you need to make something run on multiple threads in R.

BTW, you should ask this on an R forum.

5
Entering edit mode
7.6 years ago

GNU make + 'option -j ' can be used to parallelize your R-scripts:

3
Entering edit mode
7.6 years ago

Here's a bit of sample code from someone in my group who uses parallel:

### Example R code:
# setup
suppressPackageStartupMessages(library(parallel))
nslots <- Sys.getenv("NSLOTS")
if(nslots=="") {
nslots=1
}
cat("Using ", nslots, " cores (of ", detectCores(), ")\n", sep="")

# do work in parallel, syntax is just like lapply
allResults <- mclapply(theFiles, doFile, mc.cores=nslots)
# list of return values from each doFile, with same length as theFiles


He notes:

The only problem I've really found so far is that it does not synchronize output to stdout. I get around this by using sink to redirect output per function call to an individual file.

2
Entering edit mode
7.6 years ago
pld 5.0k

I am a big fan of the snow library for R, you only have to add a few lines of code and you can replace your *apply calls with parallel versions provided by snow. It also has other features but it is the easiest way I've found to parallelize R code. It does require some version of MPI to be installed, I've always used OpenMPI.

http://www.sfu.ca/~sblay/R/snow.html

Check the examples of parApply from the link above.

1
Entering edit mode
7.6 years ago
lkmklsmn ▴ 950

There are many different options for parallelization of your R code. You can check packages 'multicore' and 'snow' for parallelization inside R. In your case 'Rscript MyRun.R &' will put the process in the background and you can start another one and your server should be able to distribute the load across the CPUs.

1
Entering edit mode
7.6 years ago

parallelisation in R is primarily about how you design your functions and script. The other thing that should be considered is the architecture that it runs on. Different packages will utilise different methods of parallelisation. Having a heavy duty linux server with a large common memory will make the parallelisation process a lot easier as fork() will cover a lot of the problematic areas. Anything else and your implementation will be nodal, meaning that each node needs a copy of workspace, which could add up quickly!