Question

DESeq2 modelling and Wald's test

5

Entering edit mode

7.3 years ago

bioinfo456 ▴ 150

Hi all,

Can you please explain to me the relation between Wald's test and Negative binomial generalized linear model? As for my understanding, the count data is modeled using negative binomial generalized linear model after which Wald's test is applied to figure out whether a particular gene is significant or not. Please correct me if I'm wrong.

RNA-Seq DESeq2 • 12k views

ADD COMMENT • link updated 2.1 years ago by Picasa ▴ 680 • written 7.3 years ago by bioinfo456 ▴ 150

score 15 · Accepted Answer · 2018-04-08

15

Entering edit mode

7.3 years ago

Kevin Blighe 89k

RNA-seq raw count data 'naturally' follows a negative binomial distribution (Poisson-like), so, the DESeq2 authors model the data as such. By 'model the data', we merely imply that we build a regression model of the data such that we can make statistical inferences from it [the data].

So, after normalising the raw counts, the following occurs:

For each gene, a logistic regression model with the negative binomial as family is fit:

require(MASS)
gene1.model <- glm.nb(gene1 ~ CaseControl + ..., data=MyData)
gene2.model <- glm.nb(gene2 ~ CaseControl + ..., data=MyData)
*et cetera*

Once we have modeled each gene, a simple way to derive a P value for each model coefficient (i.e. CaseControl, etc) is by applying the Wald Test and selecting the coefficient of interest:

require(aod)
wald.test(b=coef(gene1.model), Sigma=vcov(gene1.model), Terms=c(2)) #term '2' would be CaseControl

The Wald test is a standard way to extract a P value from a regression fit.

Kevin

NB - this is not the exact code used by DESeq2, of course. This is just giving you a broad overview with some simple R functions. For one, DESeq2 models dispersion in addition to everything that I have mentioned above, and the Wald test is not used in each case to derive p-values in DESeq2.

ADD COMMENT • link 6.2 years ago by Kevin Blighe 89k

1

Entering edit mode

Thank you so much for that.

ADD REPLY • link 7.3 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Kevin, thanks for your explanation. Let me one naive question please? Why we need to make a GLM model before performing a Wald test itself (as i can understand it's just a simple t-test in rough approximation?)? Why not just perform a Wald test on count data?

ADD REPLY • link 6.1 years ago by Denis ▴ 320

3

Entering edit mode

A Wald test requires a coefficient and its standard deviation, which are tested for difference from 0. Yes, in a way that's sort of like a single group T-test, but you'd still need to perform a fit first in order to derive the coefficient.

ADD REPLY • link 6.1 years ago by Devon Ryan 105k

0

Entering edit mode

Sorry, but it's stiil not quite clear for me. I thought that T-test for two (in majority experiments) independent groups would be more intuitive and obvious. So, why a single group T-test? In addition, when i perform T-test for example in R i don't need anything except trait observations in two groups. In particular, no coefficiets are required for that. So why in DESeq2 i need to estimate some coefficient before the test itself? Could you elaborate on this please (just for understanding)? Finally, as i understand the GLM output already contains p-values. So why just not use these ones to test for DE genes?

ADD REPLY • link 6.1 years ago by Denis ▴ 320

2

Entering edit mode

We're talking about different things with the single-group T-test, just ignore that.

You can do a standard T-test for RNAseq data, your power will just be terrible. Packages like limma were developed to get around this and can be used with RNAseq. You estimate a coefficient with a T-test too (a T-test is a kind of GLM), you're just not aware of it. The p-values from the GLM (in particular, summary()) are from a Wald test. You don't have to explicitly run that separately, but it can be convenient to do so to more easily extract the information you want (it also allows more flexibility, wherein you can use contrasts).

ADD REPLY • link 6.1 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Kelvin, I have a question related to DESeq2 modeling and test too. In my case, I have two groups (6 samples for each group). I can get 1405 DEGs from the test (group A vs group B). But when I tried to use one sample of one group against the other group (sample 1 from group A vs group B), I only got 11 DEGs. In fact, I used every single sample from group A to against group B, the result is very weird:

sample 1 from group A vs group B: 11 DEGs
sample 2 from group A vs group B: 245 DEGs
sample 3 from group A vs group B: 35 DEGs
sample 4 from group A vs group B: 21 DEGs
sample 5 from group A vs group B: 7 DEGs
sample 6 from group A vs group B: 20 DEGs

My question is even if we take differences between samples into account, the number of DEGs is still very few comparing to 1405, I wonder what's wrong?

Thank you in advance !

ADD REPLY • link 3.5 years ago by FantasticAI ▴ 60

0

Entering edit mode

Your power for finding differences with a single sample is extremely low. With more samples you have more statistical power, so you find more changes.

ADD REPLY • link 3.4 years ago by Devon Ryan 105k

0

Entering edit mode

Hi,

In

gene1.model <- glm.nb(gene1 ~ CaseControl + ..., data=MyData)

I am not sure to understand what look like the object MyData ?

1) Is it the mean and dispersion of gene 1 for each sample ?

Let's say if I have 3 replicates in groupe A and 3 replicates in groupe B then we will have 2*6 = 12 values ?

ADD REPLY • link 2.1 years ago by Picasa ▴ 680