Question

Differential Expression analysis using Wald-Test

0

Entering edit mode

5.9 years ago

Duckula ▴ 50

Hello everyone, I am interested in using wald-test for comparing two groups in my expression data and find variable genes. I already know the test is implemented in packages such as deseq2 but since I already have my normalized gene expression matrix, I can't use such packages. I am wondering if any of you has any idea regarding to a package in R that can be used for such purpose.

Thanks

R RNA-Seq • 5.1k views

ADD COMMENT • link updated 5.9 years ago by Kevin Blighe 87k • written 5.9 years ago by Duckula ▴ 50

score 6 · Answer 1 · 2018-05-23

Sure, that's possible, but I don't know to which distribution your data has been normalised. In this example below, I just produce binomially-distributed random data and model it as such. In DESeq2 for RNA-seq, data is modeled as a negative binomial family with adjustments for size factors. DESeq2 also includes a dispersion parameter for each gene for statistical inference, but I guess that you don't need to do that here. You have not stated exactly how your data was normalised.

Let's create some fake data:

fakedata <- matrix(rbinom(10*20, 100, .1), ncol=10)
rownames(fakedata) <- paste0("sample", c(1:nrow(fakedata)))
fakedata <- data.frame(c(rep("control", 10), rep("case", 10)), fakedata)
colnames(fakedata) <- c("CaseControl", paste0("gene", c(1:10)))

head(fakedata, 14)
         CaseControl gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
sample1      control    10    12     6    10    13    11    12    10     8
sample2      control     6    17     8    10    10    14    10     8     8
sample3      control    10    15     7    10     2    11     7    14    11
sample4      control    13     7     7    16    17     8    10     9     6
sample5      control    13    10    11     9    14     9    12     8    11
sample6      control    13     7     9    11    11    10     8     6     8
sample7      control     5    14     9     7     5    11     4    13    13
sample8      control    19    11    15    11    10     9    11    13    12
sample9      control    13     8    15     8    13    13     5    12     9
sample10     control    10    13     5    11     9     7     8    10     7
sample11        case    11    13    10     8    13    13     9     9     7
sample12        case    10    10    12    17    13    11    13     6     9
sample13        case     8    13    12    10    10    11    15    13    12
sample14        case    10    12    15    11    11     7     7    11    13

-------------------------------------------------

Convert our outcome to categorical and ensure that 'control' is the reference level:

fakedata$CaseControl <- factor(fakedata$CaseControl, levels=c("control","case"))

-----------------------------------------------

Check the distribution

hist(data.matrix(fakedata[,2:ncol(fakedata)]))

Looks binomial to me; so, we will set the family as binomial during modelling.

----------------------------------------------

What we can then do is fit a logistic regression model for each gene and derive a p-value via the Wald test. Here, I am just doing it for gene1. DESeq2 does this for each gene via fitNbinomGLMs.

model <- glm(CaseControl ~ gene1, data=fakedata, family=binomial(link="logit"))

Take a look at the model coefficients:

coef(model)
(Intercept)       gene1 
  2.3563115  -0.2374463

-----------------------------------------------

We then apply the Wald test on whichever model coefficient ('term') we want. Our gene is the second coefficient:

library(aod)
wald.test(b=coef(model), Sigma=vcov(model), Terms=2)
Wald test:
----------

Chi-squared test:
X2 = 1.9, df = 1, P(> X2) = 0.17

Not statistically significant for gene1 in this random example.

-------------------------------------------------

Note that the Wald test can actually be used to derive a single p-value for the coefficients for more than 1 gene combined:

model <- glm(CaseControl ~ gene1 + gene5 + gene7, data=fakedata, family=binomial(link="logit"))

coef(model)
(Intercept)       gene1       gene5       gene7 
-0.25601459 -0.22555748 -0.03603168  0.29738610 

wald.test(b=coef(model), Sigma=vcov(model), Terms=2:4)
Wald test:
----------

Chi-squared test:
X2 = 4.3, df = 3, P(> X2) = 0.23

Trust that this assists. Note that if your data has indeed been normalised to a negative binomial distribution, then you can fit a model via the glm.nb function from the MASS package. DESeq2's model functions are quite specific for that particular program, from what I gather.

Kevin