DESeq2: Is dispersion estimation gene-wise or gene- and condition-wise?
1
2
Entering edit mode
8.3 years ago

Hello all,

I have barcode count data corresponding to the viability of 25+ pooled bacterial strains under various conditions. The marginal distribution of untreated strain counts appears to be Negative Binomial.

I'm trying to use DESeq2 to analyze these data, using a matrix of strains ("genes") as rows and conditions as columns. Since the variation of counts between most conditions for most strains is very large, but between replicates is relatively small, it seems sensible to estimate dispersions (in this case) on a gene- and condition-wise basis.

The language in the DESeq2 vignettes and pre-print seems to suggest the dispersion estimates are "gene-wise". So if you run DESeq() followed by plotDispEsts(), each point corresponds to the variance estimate of a gene across conditions (in my case, strain), or the variance estimate between replicates of a gene under one condition?

I think the conceptual difference I'm talking about is the same as that between blind=TRUE and blind=FALSE in the rlog() and varianceStabilizingTransformation() functions.

Finally, if DESeq2 does estimate dispersions on a solely gene-wise basis, would it be reasonable for me to estimate the dispersions of my data subsetted by each condition in turn, and then feed those results into my whole DESeqDataSet object using dispersions()?

Many thanks for taking the time to read, and for any suggestions you might have.

Eachan

RNA-Seq dispersion-estimation R DESeq2 • 5.0k views
ADD COMMENT
4
Entering edit mode
8.3 years ago
vivekbhr ▴ 690

The estimateDispersion function will estimate gene wise dispersions using all columns. Although a GLM fit (based on within group variances and means) is done before estimating so the differences in conditions are also incorporated. So I think you don't have to (and probably shouldn't) use condition-wise estimation.

By the way, why do you want to use DESeq2 here? I mean with a dataset with 25 rows only..

ADD COMMENT
2
Entering edit mode

Indeed, one should not split things by group before estimating dispersions. There is still a power increase with 25 rows when using DESeq2 (or similar) versus a straight GLM, though it's fairly small.

ADD REPLY
1
Entering edit mode

Thanks, Devon, for your advice. To add to the reason why I'm using DESeq2, I'm also trying to avoid reinventing the wheel.

ADD REPLY
1
Entering edit mode

That's usually a pretty compelling reason :)

BTW, you might instead consider limma. I'm not entirely sure how well DESeq2 scales to the number of samples you have (limma has historically had better luck there, given how it works).

ADD REPLY
0
Entering edit mode

DESeq2 is slower than in the transpose situation, but not unmanageably so - it takes about a day on our server, which I can live with once I'm sure the output is meaningful for my application. I will be sure to investigate limma as well. Thanks, again!

ADD REPLY
0
Entering edit mode

Thanks for your helpful input, Vivek. I'll re-read the DESeq2 paper again.

To answer your question, the number of columns is the issue more than the number of rows. The experiment I'm working with has 50,000 conditions in duplicate, and I'm interested in differential outgrowth of the 25 strains.

ADD REPLY
0
Entering edit mode

@Eachan wow this sounds interesting..

ADD REPLY
0
Entering edit mode

50,000 conditions in duplicate

Out of curiosity, what experimental design or protocol allows you to test so many conditions? And yes, sounds interesting!

ADD REPLY

Login before adding your answer.

Traffic: 2618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6