Question

R loop and other languages

1

Entering edit mode

4.5 years ago

mel22 ▴ 100

Hi , I am trying to perform a big number of case control status permutations, this is my code:

Y<-matrix(0,1285,100000)
Y<-apply(Y,c(1,2), function(x) sample(c(0,1),1))
NAMES <- names(df[,51:7298])
res<-as.data.frame(matrix(nrow=100000,ncol=length(NAMES),dimnames=NULL))
for (s in 1:100000) {
  for(i in 51:7298)
  {
    fit<-glm(Y[,s]~df$cov1+df[,i]+df$cov2+ df$cov3,data=df,family=binomial())      
    sfit <-summary(fit)
    j=i-50
    res[s,j]<-as.numeric(sfit$coefficients[3,4])
    colnames(res)<-NAMES
   }}

It's too slow under R, will it be better to use C++ or Python ? I have a beginner level in both Python and C++language, any script suggestions are welcome !

Thank you

R • 932 views

ADD COMMENT • link 4.5 years ago by mel22 ▴ 100

1

Entering edit mode

If time really matters, C++ is the best.

ADD REPLY • link 4.5 years ago by shoujun.gu ▴ 350

1

Entering edit mode

Yours is an embarrassingly parallel problem. Learning to recognize such cases and how to handle them will go a long way in helping you deal with heavy computations. As already mentioned, check the various R packages available for this. Also this is where you'd make use of a compute cluster if you've access to one. Recoding in a fast language (C, C++...but not python) should be the absolute last resort (e.g. when all you can use is a laptop) or done as an exercise.

ADD REPLY • link 4.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you very much Jean-Karim, The problem that I am failing even on a compute cluster. The loop I wrote take a lot of memory exceeding the cluster performance (Or may be I am not using the best way ... ) That's way I am lokking how to mek better this or to write it in anothjer language ... C++ still not easy for me as a beginner

ADD REPLY • link 4.5 years ago by mel22 ▴ 100

0

Entering edit mode

Start by ensuring that the outer loop sends jobs to different nodes of your cluster. Rewriting without parallelization may not solve your memory issue if your cluster nodes don't have the required RAM. The way your code is written you keep data for each run in memory so your memory use keeps on growing. Your res matrix is about 5.5 GB, your Y matrix is 1 GB so this piece of code needs at least 7 GB without including the data in df and space required to compute the regression model. Also stick with matrices if you don't need the flexibility/features of data frames. Accessing elements of a data frame can be several orders of magnitude slower than with a matrix as you can see with this:

library(microbenchmark)
mat <- matrix(runif(100000), nrow = 1000, ncol = 100)
df <- as.data.frame(mat)
microbenchmark(times=1000, unit="ms", mat[105,87], df[105,87])

On my machine, this shows the data frame to be ~50 times slower than the matrix.

ADD REPLY • link 4.5 years ago by Jean-Karim Heriche 27k

score 3 · Answer 1 · 2019-10-20

3

Entering edit mode

4.5 years ago

ATpoint 82k

I would rather invest into optimising code in a language you know well rather than trying out several ones at which you are a beginner. Check out parallelization in R, be it using foreach or functions like mclapply from the parallel package. Also, critically review your code towards redundant operations that might slow down execution. What exactly is this code doing and why? What is the input data format and what is the final output?

ADD COMMENT • link 4.5 years ago by ATpoint 82k

0

Entering edit mode

Thank you ATpoint, it's a code for a case-control study with genotyping data, I ned to perform a certain number of permutation and keep the value of each SNP after each permutation to compare with the observed value

ADD REPLY • link 4.5 years ago by mel22 ▴ 100

0

Entering edit mode

Your code is actually fine, it is simply a lot of data. I suggest you replace the outer loop by mclapply or any of the available parallelization commands, get a HPC node with a lot of memory and many CPUs and then wait till everything is finished. I do not know about memory usage but if you use, say a 72-core node if available, this should speed up things dramatically with mclapply.

ADD REPLY • link 4.5 years ago by ATpoint 82k