Question: Clarification on how DSEeq2 Dispersion Curve is Generated
0
gravatar for brismiller
17 months ago by
brismiller10
Bellingham, WA, USA
brismiller10 wrote:

Hi everyone,

I have a clarification question on how the average expression versus dispersion curve is generated. From the paper, it says that Deseq2 uses 'all samples' in making the plot, but is that all samples for a given sample type (genotype) or is it all samples regardless of genotype?

I am worried that gene dispersion information is being shared between genotypes, and I am wondering if this is valid. I understand that DESeq2 uses the correlation between average gene expression and dispersion for dispersion shrinkage, but does this assumption hold true between genotypes?

Quote from DESeq2 paper:

"Our DESeq method [4] detects and corrects dispersion estimates that are too low through modeling of the dependence of the dispersion on the average expression strength over all samples." Deseq2 Paper

ADD COMMENTlink modified 17 months ago by Kevin Blighe48k • written 17 months ago by brismiller10
2
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe48k
Kevin Blighe48k wrote:

Yes, from what I understand, DESeq2 does not fit group-specific dispersion estimates, i.e., the dispersion is calculated for each gene across all samples irrespective of what you specify in your design model. In very large datasets, it may be more intuitive to calculate dispersion across your groups of interest and apply weightings, whilst, for smaller datasets, trying to do this could really mess up your normalisation and, it follows, your statistical interpretations from the data.

The dispersion is calculated as:

variance / mean^2

...which is the same as CoV^2 (square coefficient of variation). See here: https://support.bioconductor.org/p/88880/

I have my own summary of how DESeq2 models dispersion:

Part I

  1. Calculate the maximum-likelihood estimate (MLE) of dispersion for each gene in the dataset (black dots).

  2. Model the MLEs (red curve)

  3. From the model curve fit in 2, predict a value for each gene

Part II

  1. Fit an empirical Bayes regression model to the MLEs and use the predicted values from the model curve fit in Step I, Part 3 (above) as the mean priors for each gene in the model. In empirical Bayesian statistics, by supplying 'priors' to the model, one is saying that these priors are the measured / empirical values and that we want to 'shrink' our current data to match the distribution of these priors.

  2. Predict values from this model (blue dots) - these are the final dispersion estimates. What happens is that genes with lower counts have higher dispersion and are 'shrunk' more toward the red line than higher counts, which have lower dispersion.*”

Apparently that's my take. Also see that of the developer on this subject:

Kevin

ADD COMMENTlink modified 15 months ago • written 17 months ago by Kevin Blighe48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2204 users visited in the last hour