Question

Code Optimization For Large Scale Data

1

Entering edit mode

12.4 years ago

Julien Textoris ▴ 430

Hi,

I'm trying to deal with mass spectroscopy data that I uploaded with MALDIquant libray. I have 384 spectra. In MALDIquant, spectra are stored in an object composed of a mass vector, an intensity vector and metaData. All spectra are in a list object. One problem to handle all spectra and for example to normalize them and align peaks is that all mass vectors are not equivalent (in length and in content).

To circumvent this, i'd like to reduced the data. Indeed, for the mass 1000 Da, for example, i have up to 12 isolated values :

spectra.bc[[1]]@mass[1:10]
 [1] 1000.129 1000.201 1000.273 1000.346 1000.418 1000.490 1000.563 1000.635 1000.708 1000.780

The typical length of a spectra is around 96000 values. I think for my purpose that a precision of 1 Da would be sufficient, so I would like to do compute a matrix of 10 001 rows (values of mass from 1000 to 10000) and intensities of each samples/spectra in columns, with something like this :

(sub is a list of 16 spectra. Each spectra s has masses in s@mass and intensities in s@intensity)

m = matrix(nr=10001,nc=length(sub),byrow=F,
       data=lapply(sub, function(s) {
          sapply(1000:10000,function(x) {
            median(s@intensity[which(round(s@mass)==x)])
          })         
       }))

But even with this portion of code, it takes about two minutes for one spectra to compute a thousand values (i ran a test with sapply(1000:2000, ...)). So it's very time consuming.

Do you know if i could improve the script to do the job faster ?

Thanks a lot,

Julien

Here is the description of the spectrum object :

MassSpectrum Object extends AbstractMassObject with slots :

.cache: "environment", internal use only mass: "vector", mass or mass-to-charge ratio intensity: "vector", intensities for measured mass-to-charge ratios metaData: "list", some metadata to describe the spectrum

spectra is a list of MassSpectrum objects.

The problem comes from the fact that if you have a list of 10 object (spectra) and do :

lapply(spectra, function(x) {length(x@mass)})

you don't get the same length for each mas vector.

In fact, i would like to put the intensities of each sample into a column of a matrix, but for that, i have to define a similar list of rows = masses for every sample. As i don't care for such precision, i imagined i could reduce the data by taking only one value of intensity for all intensities of similar masses ie when "round(s@mass) == x". The idea is that at start i have a vector of length 96000, and at the end, i will have a vector of length 10001, with the first 10 values transformed into one median, the next 12 values transformed into a second median, etc ...

I hope it is clearer that way but i'm not sure !

mass-spec code r • 2.7k views

ADD COMMENT • link 12.4 years ago by Julien Textoris ▴ 430

0

Entering edit mode

I don't know much about mass spec data so bear with me. So you basically want to: for every mass value, get the median of the list of intensities?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Yes, kind of, in fact i want to reduce the length of the vector, and i want to have a vector of the same length for all samples. At the beginning, i have for spectra 1 a list of mass like : 1000.129, 1000.201, ... , 1345.134, 1345.245, ... and for example for spectra 2 I would have a list of different masses : 1000.198, 1000.267, ... But in fact, i don't care about such precision, so i would like to get only one value for all 1000's values, and only one for all 1345's values. Then, it would be possible to have a matrix with a unique list of masses (integers from 1000 to 10000) as row.names(m)

ADD REPLY • link 12.4 years ago by Julien Textoris ▴ 430

0

Entering edit mode

Is your data in tab delimited format? Can you show me a couple of lines of the data?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

In fact, i was talking of data already loaded in R, because data are in a proprietary format from bucker daltonics, and the library allows for automatic input. I'll put the description of the object in an answer (comment to short)

ADD REPLY • link 12.4 years ago by Julien Textoris ▴ 430

0

Entering edit mode

It might be faster to output it to text file and use python or something to deal with it. Loading it into a matrix and then working with it in R tend to be more process intensive.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Yes, i thought of that, but to do that, i would have to export the spectra one by one to separate text files. But that's a lead, and maybe i can use the function that was written to import data into R to automatically create these text files. Thanks for the idea !

ADD REPLY • link 12.4 years ago by Julien Textoris ▴ 430

score 2 · Answer 1 · 2011-11-29

Check out the PLYR package from Hadley Wickham you can use ddply, ldply ect...

Reasons for using PLYR:

1: Very easy to use. You tell the loop what format in and what format out. 2: Threading! You can thread loops over multiple processors very easily. This will drastically speed your computes.

happy hunting.

score 0 · Answer 2 · 2011-12-01

Thanks for the hint,

library PLYR seems really interesting, but after reading the manual, i read the foreach manual, and then, i think PLYR do not add a lot to foreach for me. Here is a first solution :

a = proc.time()
sub = spectra.bc[c(1:2)]
m = matrix(nr=501,nc=length(sub),byrow=F,
           data=unlist(lapply(sub, function(s) {
              sapply(1000:1500,function(x) {
                median(s@intensity[which(round(s@mass)==x)])
              })
           })))
b= proc.time()
b[3]-a[3]

This took 27s

and with foreach :

library(foreach)
library(doMC)
registerDoMC(4)

a = proc.time()
sub = spectra.bc[c(1:2)]

m = foreach(s = unlist(sub), .combine="cbind") %do% {
        foreach(x = 1000:1500, .combine="c", .packages="MALDIquant") %dopar% {
                median(s@intensity[which(round(s@mass)==x)])
        }
}
b = proc.time()
b[3]-a[3]

This time with 4 CPUs, it took 7s !

So this is a solution. But for people who don't have multiple core/processors, i think i will also explore the way to export data to text file and proces them outside R with perl or python to see if it's OK on just one processor in a reasonable amount of time.

Julien