Question: DESeq2: Is dispersion estimation gene-wise or gene- and condition-wise?
gravatar for Eachan Johnson
3.3 years ago by
Cambridge, MA
Eachan Johnson20 wrote:

Hello all,

I have barcode count data corresponding to the viability of 25+ pooled bacterial strains under various conditions. The marginal distribution of untreated strain counts appears to be Negative Binomial.

I'm trying to use DESeq2 to analyze these data, using a matrix of strains ("genes") as rows and conditions as columns. Since the variation of counts between most conditions for most strains is very large, but between replicates is relatively small, it seems sensible to estimate dispersions (in this case) on a gene- and condition-wise basis.

The language in the DESeq2 vignettes and pre-print seems to suggest the dispersion estimates are "gene-wise". So if you run DESeq() followed by plotDispEsts(), each point corresponds to the variance estimate of a gene across conditions (in my case, strain), or the variance estimate between replicates of a gene under one condition?

I think the conceptual difference I'm talking about is the same as that between blind=TRUE and blind=FALSE in the rlog() and varianceStabilizingTransformation() functions.

Finally, if DESeq2 does estimate dispersions on a solely gene-wise basis, would it be reasonable for me to estimate the dispersions of my data subsetted by each condition in turn, and then feed those results into my whole DESeqDataSet object using dispersions()?

Many thanks for taking the time to read, and for any suggestions you might have.


ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Eachan Johnson20
gravatar for vivekbhr
3.3 years ago by
vivekbhr510 wrote:

The estimateDispersion function will estimate gene wise dispersions using all columns. Although a GLM fit (based on within group variances and means) is done before estimating so the differences in conditions are also incorporated. So I think you don't have to (and probably shouldn't) use condition-wise estimation.

By the way, why do you want to use DESeq2 here? I mean with a dataset with 25 rows only..


ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by vivekbhr510

Indeed, one should not split things by group before estimating dispersions. There is still a power increase with 25 rows when using DESeq2 (or similar) versus a straight GLM, though it's fairly small.

ADD REPLYlink written 3.3 years ago by Devon Ryan89k

Thanks, Devon, for your advice. To add to the reason why I'm using DESeq2, I'm also trying to avoid reinventing the wheel.

ADD REPLYlink written 3.3 years ago by Eachan Johnson20

That's usually a pretty compelling reason :)

BTW, you might instead consider limma. I'm not entirely sure how well DESeq2 scales to the number of samples you have (limma has historically had better luck there, given how it works).

ADD REPLYlink written 3.3 years ago by Devon Ryan89k

DESeq2 is slower than in the transpose situation, but not unmanageably so - it takes about a day on our server, which I can live with once I'm sure the output is meaningful for my application. I will be sure to investigate limma as well. Thanks, again!

ADD REPLYlink written 3.3 years ago by Eachan Johnson20

Thanks for your helpful input, Vivek. I'll re-read the DESeq2 paper again.

To answer your question, the number of columns is the issue more than the number of rows. The experiment I'm working with has 50,000 conditions in duplicate, and I'm interested in differential outgrowth of the 25 strains.

ADD REPLYlink written 3.3 years ago by Eachan Johnson20

@Eachan wow this sounds interesting..

ADD REPLYlink written 3.3 years ago by vivekbhr510

"50,000 conditions in duplicate" 

Out of curiosity, what experimental design or protocol allows you to test so many conditions? And yes, sounds interesting!

ADD REPLYlink written 3.3 years ago by dariober10.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1420 users visited in the last hour