improve my slow R function
5
1
Entering edit mode
6.0 years ago
lilepisorus ▴ 30

I am trying to calculate the percentage of heterozygous sites for each sample. I wrote a R script using stringr to count the number of heterozygous sites and divided by the number of known sites (the number of all sites minus the missing sites "N"). Here is my function:

The column of my data frame is for each sample and the row is for each SNP site.

PercentPolymorphism <- function(df) {
countN=c(); 
knownSitesN=c();
heteroSitesN=c();
PercPolym=c()
for (i in 1:ncol(df)) {
    countN[i]= sum(str_count(df[, i], "N"));
    knownSitesN[i]=nrow(df) - countN[i]
    heteroSitesN[i] = sum(str_count(df[, i], c("R", "Y", "M", "K", "S", "W")));
    PercPolym[i] = heteroSitesN[i]/knownSitesN[i]
}#for

df.new=as.data.frame(rbind(knownSitesN, heteroSitesN, PercPolym, df))
return(df.new)
} #function

I have around 5000 samples and merely 1Mb SNP data. This function works with the data without throwing any error, but it takes very long to complete. 

Any one could suggest to modify my function in order to improve the computation efficiency ?

Thanks in advance!

 

Li

R • 3.8k views
ADD COMMENT
0
Entering edit mode

Please post example data.

ADD REPLY
0
Entering edit mode
An example is as follows:

   sample1   sample2   sample3  sample4   sample5  
Site1 A       T       C       T       W      W
Site2 A       W       M       N       N       N
Site3 R       A       A       A       N       A
Site4 A       T       C       W       T       T
Site5 G       A       N       N       A       A
ADD REPLY
0
Entering edit mode

This contains six sample columns, not five.

ADD REPLY
0
Entering edit mode

use a character matrix instead of a data frame

ADD REPLY
4
Entering edit mode
6.0 years ago
PercentPolymorphism <- function(df) {
countN=c(); 
knownSitesN=c();
heteroSitesN=c();
PercPolym=c()
countN = lapply(df, function(x) sum(str_count(x, "N")))
knownSitesN = nrow(df)-c(1:ncol(df))

and so on. You can also use mclapply() rather than just lapply(), to take advantage of multiple cores. Avoid for loops in R.

ADD COMMENT
3
Entering edit mode

While it's true that for-loops in R are slow, the use of *apply is not necessarily better. I think *apply functions use for loops anyway. Here's an example:

system.time({
## FOR LOOP
ilog<- rep(NA, 1000000)
for(i in 1:1000000){
    ilog[i]<- log(i)
}
})
   user  system elapsed
  1.401   0.004   1.403
 
system.time({
## SAPPLY
ilog<- sapply(1:1000000, log)
})
   user  system elapsed
  3.394   0.011   3.403

system.time({
+ ## LAPPLY
+ ilog<- lapply(1:1000000, log)
+ })
   user  system elapsed
  2.564   0.004   2.565

 

 

ADD REPLY
1
Entering edit mode

I borrowed ideas from this thread (http://stats.stackexchange.com/questions/6380/how-to-optimize-my-r-script-in-order-to-use-multicore) by using foreach and multicores:

library(foreach)
library(doMC)
registerDoMC()

#change the previous:   for (i in 1:ncol(df)) {         
foreach(i=1:ncol(df)) %dopar% { 

I am expecting it will largely reduce the computing time, as I am using a 32 core server.

 

 

ADD REPLY
0
Entering edit mode

You can use snow, which is a parallelization library built off of MPI. Really easy to use, you add a few lines of set up and then any place you have an apply, snow provides a parallel version you can just drop in.

ADD REPLY
0
Entering edit mode

Glad they've gotten faster, it's often annoying to write everything in a functional programming style.

ADD REPLY
1
Entering edit mode
6.0 years ago

To solve this problem you must first convert your data frame to a long format.

Your file:

site sample1 sample2 sample3 sample4 sample5 sample6
Site1 A T C T W W
Site2 A W M N N N
Site3 R A A A N A
Site4 A T C W T T
Site5 G A N N A A


Code:

> library(tidyr)
> library(dplyr)
> mydf = read.table('biostar.csv', header=T)
> mydf.long = mydf %>% gather(sample, genotype, -site)
> mydf.long %>% head
    site  sample genotype
1  Site1 sample1        A
2  Site2 sample1        A
3  Site3 sample1        R
4  Site4 sample1        A
5  Site5 sample1        G
6  Site1 sample2        T
7  Site2 sample2        W
8  Site3 sample2        A

EDIT: if you prefer the reshape2 library instead of tidyr:

> library(reshape2)
> mydf.long = mydf %>% 
   melt(id.vars=c('site'), variable.name='sample') %>% 
   rename(genotype=value)

Now I am not really sure about what your function must do, but the principle is to use group_by and summarise from dplyr.

> mysummary = mydf.long %>%
    group_by(sample) %>%
    summarise(
        totN         = sum(genotype == 'N', na.rm=T),  # in alternative length(site[genotype=='N'])
        knownSitesN  = sum(genotype != 'N', na.rm=T),
        heteroSitesN = sum (genotype %in% c("R", "Y", "M", "K", "S", "W"), na.rm=T),
        percPolyM    = heteroSitesN/knownSitesN
    )

> mysummary
   sample totN knownSitesN heteroSitesN percPolyM
1 sample1    0           5            1 0.2000000
2 sample2    0           5            1 0.2000000
3 sample3    1           4            1 0.2500000
4 sample4    2           3            1 0.3333333
5 sample5    2           3            1 0.3333333
6 sample6    1           4            1 0.2500000

Again, instead of doing operations on columns, you must reformat the dataframe to a long format, so that the sites are on the rows instead. Then you can use the standard group_by operations in dplyr to calculate your summaries. You never need to iterate on columns in R.

ADD COMMENT
0
Entering edit mode

I am trying to follow your scripts, but got an error when doing 

mydf.long = mydf %>% gather(sample, genotype, -site)

Error:

Error in .Call("reshape2_melt_dataframe", PACKAGE = "reshape2", data,  : 

  negative length vectors are not allowed

Do you have a clue?

 

Thanks

Li

ADD REPLY
0
Entering edit mode

It's a weird error message, because gather is from tidyr and %>% is from dplyr. None of these functions are from reshape2. Can you try from a clean workspace, loading only the tidyr and dplyr libraries?

ADD REPLY
0
Entering edit mode

If you prefer to use reshape2 instead of tidyr, you can use the melt function instead of gather. I'll update the example accordingly.

ADD REPLY
1
Entering edit mode
6.0 years ago
foo<-paste(sample(c("A","C","G","T","R","Y","M","K","S","W","N"),2500,rep=T),sep='',collapse='')
bar<-matrix(unlist(strsplit(foo,'')),ncol=500)

useful_stats<-function(x){
  countN<-sum(x=='N')
  known<-sum(x!='N')
  hetero<-sum(x %in% c("R", "Y", "M", "K", "S", "W"))
  percpoly<-as.numeric(hetero)/as.numeric(known)
  list("countN"=countN,"known"=known,"hetero"=hetero,"percpoly"=percpoly)
}

seems pretty fast

> system.time(apply(bar,2,useful_stats))
   user  system elapsed 
  0.018   0.000   0.019 
ADD COMMENT
0
Entering edit mode

It is good for only few columns, but when it extend to the whole dataset. I still got a memory issue:

Error in aperm.default(X, c(s.call, s.ans)) : 

  long vectors not supported yet: memory.c:1636

Calls: apply -> aperm -> aperm.default

Execution halted

 

I will try to truncate my dataset into several pieces and see if it goes through.

Thanks!

ADD REPLY
1
Entering edit mode
6.0 years ago

Pre-allocate space for your vectors; otherwise, you force R to reallocate space as the vectors grow. This can really slow down R loops (cf. http://musicallyut.blogspot.com/2012/07/pre-allocate-your-vectors.html ):

> some_vector <- vector("numeric", ncol(some_data_frame))
> # do stuff with some_vector...

This behaves with somewhat similar performance as dariober's recommendation to use rep():

> system.time({
+ ilog <- rep(NA, 1000000)
+ for (i in 1:1000000){
+     ilog[i] <- log(i)
+ }
+ })
   user  system elapsed 
  1.495   0.020   1.522 

> system.time({
+ ilog <- vector("numeric", 1000000)
+ for (i in 1:1000000) {
+     ilog[i] <- log(i)
+ }
+ })
   user  system elapsed 
  1.505   0.016   1.528 

Use rbindlist() instead of rbind() to make your data frame (cf. http://stackoverflow.com/questions/15673550/why-is-rbindlist-better-than-rbind ).

See The R Inferno for a larger discussion on why growing objects in R is not ideal. It is a very useful guide on navigating some of R's pitfalls.

ADD COMMENT
0
Entering edit mode
6.0 years ago
komal.rathi ★ 3.7k

Not sure what your function is doing but you can do something like below by changing your function to work on vectors instead of dataframes/matrices. Use aaply to apply a function across columns in a parallel manner:

# load libraries
library(plyr)
library(doMC) 

# register 8 cores (you can give more depending on the access)
registerDoMC(8) 

# function definition
# not really sure what your function is doing 
# modify function to work on vectors instead of dataframe/matrix
# i.e. treat each column as vector
PercentPolymorphism <- function(x){
  countN = sum(x %in% "N")
  knownSitesN = length(x) - countN
  heteroSitesN = sum(x %in% c("R", "Y", "M", "K", "S", "W"))
  PercPolym = heteroSitesN/knownSitesN
  df = data.frame(knownSitesN,heteroSitesN,PercPolym)
}

# use parallel = TRUE
# margins = 2 to apply function on columns
res <- data.frame()
res <- rbind(res, aaply(.data = df, .margins = 2, .fun = PercentPolymorphism, .parallel = TRUE)) 
ADD COMMENT
0
Entering edit mode

Thanks!

I bet to write function on vectors is the most efficient way to do it. 

However, when go to step apply 

res <- rbind(res, aaply(.data = df, .margins = 2, .fun = PercentPolymorphism, .parallel = TRUE)) 

I still get the memory errors:

Error in aperm.default(X, c(s.call, s.ans)) : 

  long vectors not supported yet: memory.c:1636

Calls: as.data.frame -> rbind -> apply -> aperm -> aperm.default

Execution halted

 

I ended up by subset my big dataframe into several small pieces. I have around 5000 columns, and I do 1000 each time, and it can go through taking around 15-20 minutes. At last, I put the results file together. 

Thanks, everyone! If someone comes up with the idea, what is behind the limitation, I would love to hear. 

ADD REPLY

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6