4 weeks ago by

University College London Cancer Institute

Sure, that's possible, but I don't know to which distribution your data has been normalised. In this example below, I just produce binomially-distributed random data and model it as such. In DESeq2 for RNA-seq, data is modeled as a negative binomial family with adjustments for size factors. DESeq2 also includes a dispersion parameter for each gene for statistical inference, but I guess that you don't need to do that here. You have not stated exactly how your data was normalised.

Let's create some fake data:

```
fakedata <- matrix(rbinom(10*20, 100, .1), ncol=10)
rownames(fakedata) <- paste0("sample", c(1:nrow(fakedata)))
fakedata <- data.frame(c(rep("control", 10), rep("case", 10)), fakedata)
colnames(fakedata) <- c("CaseControl", paste0("gene", c(1:10)))
head(fakedata, 14)
CaseControl gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
sample1 control 10 12 6 10 13 11 12 10 8
sample2 control 6 17 8 10 10 14 10 8 8
sample3 control 10 15 7 10 2 11 7 14 11
sample4 control 13 7 7 16 17 8 10 9 6
sample5 control 13 10 11 9 14 9 12 8 11
sample6 control 13 7 9 11 11 10 8 6 8
sample7 control 5 14 9 7 5 11 4 13 13
sample8 control 19 11 15 11 10 9 11 13 12
sample9 control 13 8 15 8 13 13 5 12 9
sample10 control 10 13 5 11 9 7 8 10 7
sample11 case 11 13 10 8 13 13 9 9 7
sample12 case 10 10 12 17 13 11 13 6 9
sample13 case 8 13 12 10 10 11 15 13 12
sample14 case 10 12 15 11 11 7 7 11 13
```

## -------------------------------------------------

Convert our outcome to categorical and ensure that 'control' is the reference level:

```
fakedata$CaseControl <- factor(fakedata$CaseControl, levels=c("control","case"))
```

## -----------------------------------------------

Check the distribution

```
hist(data.matrix(fakedata[,2:ncol(fakedata)]))
```

Looks binomial to me; so, we will set the family as binomial during modelling.

## ----------------------------------------------

What we can then do is fit a logistic regression model for each gene and derive a p-value via the Wald test. Here, I am just doing it for *gene1*. DESeq2 does this for each gene via fitNbinomGLMs.

```
model <- glm(CaseControl ~ gene1, data=fakedata, family=binomial(link="logit"))
```

Take a look at the model coefficients:

```
coef(model)
(Intercept) gene1
2.3563115 -0.2374463
```

## -----------------------------------------------

We then apply the Wald test on whichever model coefficient ('term') we want. Our gene is the second coefficient:

```
library(aod)
wald.test(b=coef(model), Sigma=vcov(model), Terms=2)
Wald test:
----------
Chi-squared test:
X2 = 1.9, df = 1, P(> X2) = 0.17
```

Not statistically significant for *gene1* in this random example.

## -------------------------------------------------

Note that the Wald test can actually be used to derive a single p-value for the coefficients for more than 1 gene combined:

```
model <- glm(CaseControl ~ gene1 + gene5 + gene7, data=fakedata, family=binomial(link="logit"))
coef(model)
(Intercept) gene1 gene5 gene7
-0.25601459 -0.22555748 -0.03603168 0.29738610
wald.test(b=coef(model), Sigma=vcov(model), Terms=2:4)
Wald test:
----------
Chi-squared test:
X2 = 4.3, df = 3, P(> X2) = 0.23
```

Trust that this assists. Note that if your data has indeed been normalised to a negative binomial distribution, then you can fit a model via the glm.nb function from the MASS package. DESeq2's model functions are quite specific for that particular program, from what I gather.

Kevin