14 months ago by
Republic of Ireland
RNA-seq raw count data 'naturally' follows a negative binomial distribution (Poisson-like), so, the DESeq2 authors model the data as such. By 'model the data', we merely imply that we build a regression model of the data such that we can make statistical inferences from it [the data].
So, after normalising the raw counts, the following occurs:
For each gene, a logistic regression model with the negative binomial as family is fit:
gene1.model <- glm.nb(gene1 ~ CaseControl + ..., data=MyData)
gene2.model <- glm.nb(gene2 ~ CaseControl + ..., data=MyData)
Once we have modeled each gene, a simple way to derive a P value for each model coefficient (i.e. CaseControl, etc) is by applying the Wald Test and selecting the coefficient of interest:
wald.test(b=coef(gene1.model), Sigma=vcov(gene1.model), Terms=c(2)) #term '2' would be CaseControl
The Wald test is a standard way to extract a P value from a regression fit.
NB - this is not the exact code used by DESeq2, of course. This is just giving you a broad overview with some simple R functions. For one, DESeq2 models dispersion in addition to everything that I have mentioned above, and the Wald test is not used in each case to derive p-values in DESeq2.