Question: edgeR / deSeq parallel usage
1
gravatar for russhh
5.0 years ago by
russhh4.6k
UK, U. Glasgow
russhh4.6k wrote:

Hi.

I'm using edgeR for differential expression analysis of RNASeq data (counts downloaded from TCGA).

The workflow I'm using follows the GLM functionality of edgeR, as descirbed in the vignette for edgeR.

dge <- DGEList(...)
design <- model.matrix(...) # non-singular

y <- estimateGLMCommonDisp(dge, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef = ncol(design) + c(-1,0))

Fitting the model, as above, can take ages on our server - seemingly because the dataset is pretty large, the design matrix is pretty large etc etc AND, only a single processor is used by edgeR.

I can't find any options within edgeR that can take advantage of parallel processing and was wondering whether any of the alternative RNASEq expression packages, such as DESeq, had this capability

All the best

Russ

edger rna-seq R • 3.9k views
ADD COMMENTlink modified 5.0 years ago by Michael Love1.9k • written 5.0 years ago by russhh4.6k
1

DESeq2 doesn't have any option for parallelism either. You could add this though (that's actually how DEXseq's parallelism works). You could also do the same with edgeR, though.

BTW, TCGA data is often from RSEM, which won't be integers. In that case, use limma/voom instead.

ADD REPLYlink written 5.0 years ago by Devon Ryan91k

The data I'm using is count based, it's a release from a while ago; the more recent lung RNASeq data is 'normalised' which basically destroys my chances of using the most appropriate tools for RNASeq data (the newer gene-level release even mocks me with a 'raw counts' column... of floats; though I believe you can munge the exon-level data into gene-level counts).

ADD REPLYlink written 5.0 years ago by russhh4.6k
2

Consider using limma::voom even for the "real count" based data, though -- especially on large datasetsThe "voom step" is lightening fast as compared to the "estimate dispersion" steps that both edgeR and DESeq(2) have to run through ... this depends on the size of the dataset, but we're talking less than 1 minute for the entirety of a limma::voom pipeline (variance estimation through lmFit and testing contrasts) to several (3+) hours for just estimating the dispersion (with DESeq2, at least -- I suspect edgeR would be close to the same speed).

ADD REPLYlink written 5.0 years ago by Steve Lianoglou5.0k

Thanks both for recommending limma/voom. I think this is probably the best way to go.

I attempted to speed edgeR up by splitting the dataset, fitting the dispersions in each split and then merging the data back together. I suspected this was flawed but tried it anyway and, though it was faster, the trended dispersions were quite different within the tails of average log CPM (way more than I would have expected).

So limma / voom looks like a really good option for me. Regards, Russ

ADD REPLYlink written 5.0 years ago by russhh4.6k

Just checking :) There have been a lot of posts both here and elsewhere with people trying to use "raw counts" that turn out to not be so raw.
 

ADD REPLYlink written 5.0 years ago by Devon Ryan91k

thanks, looks like my MPI/lapack setup is flawed (though I intiially thought I could alter the R source to use mclapply or similar)

ADD REPLYlink written 5.0 years ago by russhh4.6k

You should be able to just use bplapply (see the BiocParallel package), though you'll need to split things on your own. Keep in mind that this will mostly only work with the variance estimation, though that's probably the bottleneck anyway.

ADD REPLYlink written 5.0 years ago by Devon Ryan91k

" (the newer gene-level release even mocks me with a 'raw counts' column... of floats; though I believe you can munge the exon-level data into gene-level counts)."

Based on some detective work, those floating point "raw_counts" in the MAGE-TAB export appear to be the estimated count output from RSEM. RSEM does not simply output the number of fragments (equal to read count if single-end seq) that overlap with a gene by some definition, but includes an estimation of the contribution from reads that map to multiple locations using an Expectation-Maximization (EM) algorithm.

The author of RSEM suggests that these estimated counts can be used in methods expecting read counts (such as edgeR) by rounding them to the nearest integer value. I suspect this rescue method might work in many cases, but add substantial noise to some genes.

 

 

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by David Fredman990

It should be noted that the authors of tools like edgeR typically prefer people not do that and instead use things like limma::voom (I can't count how many times I've seen Gordon Smyth write that).

ADD REPLYlink written 5.0 years ago by Devon Ryan91k

maybe you could first exclude rows that have too low read counts in order to reduce amount of data

ADD REPLYlink written 5.0 years ago by Biomonika (Noolean)3.1k
5
gravatar for Michael Love
5.0 years ago by
Michael Love1.9k
United States
Michael Love1.9k wrote:

I'm currently working on a parallel wrapper for DESeq2. The dispersion estimation for parallel execution is easy, but I need a bit more time to restructure for the LFC calculation.

As Steve mentions, limma is very fast when you have many 100s of samples as in TCGA.

ADD COMMENTlink written 5.0 years ago by Michael Love1.9k
3

In DESeq2 version >= 1.5.54 you can do:

library(BiocParallel)
register(MulticoreParam(10))
DESeq(dds, parallel=10)

See the BiocParallel package for the backend options. More information under ?DESeq.

ADD REPLYlink written 4.9 years ago by Michael Love1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour