14 months ago by

Republic of Ireland

RNA-seq raw count data 'naturally' follows a negative binomial distribution (Poisson-like), so, the DESeq2 authors model the data as such. By 'model the data', we merely imply that we build a regression model of the data such that we can make statistical inferences from it [the data].

So, after normalising the raw counts, the following occurs:

For each gene, a logistic regression model with the negative binomial as family is fit:

```
require(MASS)
gene1.model <- glm.nb(gene1 ~ CaseControl + ..., data=MyData)
gene2.model <- glm.nb(gene2 ~ CaseControl + ..., data=MyData)
*et cetera*
```

Once we have modeled each gene, a simple way to derive a P value for each model coefficient (i.e. CaseControl, etc) is by applying the Wald Test and selecting the coefficient of interest:

```
require(aod)
wald.test(b=coef(gene1.model), Sigma=vcov(gene1.model), Terms=c(2)) #term '2' would be CaseControl
```

The Wald test is a standard way to extract a P value from a regression fit.

Kevin

NB - this is *not* the exact code used by DESeq2, of course. This is just giving you a broad overview with some simple R functions. For one, DESeq2 models dispersion in addition to everything that I have mentioned above, and the Wald test is not used in each case to derive p-values in DESeq2.