Question: Differential Expression analysis using Wald-Test
gravatar for Duckula
11 months ago by
Duckula20 wrote:

Hello everyone, I am interested in using wald-test for comparing two groups in my expression data and find variable genes. I already know the test is implemented in packages such as deseq2 but since I already have my normalized gene expression matrix, I can't use such packages. I am wondering if any of you has any idea regarding to a package in R that can be used for such purpose.


rna-seq R • 749 views
ADD COMMENTlink modified 11 months ago by Kevin Blighe41k • written 11 months ago by Duckula20
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

Sure, that's possible, but I don't know to which distribution your data has been normalised. In this example below, I just produce binomially-distributed random data and model it as such. In DESeq2 for RNA-seq, data is modeled as a negative binomial family with adjustments for size factors. DESeq2 also includes a dispersion parameter for each gene for statistical inference, but I guess that you don't need to do that here. You have not stated exactly how your data was normalised.

Let's create some fake data:

fakedata <- matrix(rbinom(10*20, 100, .1), ncol=10)
rownames(fakedata) <- paste0("sample", c(1:nrow(fakedata)))
fakedata <- data.frame(c(rep("control", 10), rep("case", 10)), fakedata)
colnames(fakedata) <- c("CaseControl", paste0("gene", c(1:10)))

head(fakedata, 14)
         CaseControl gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
sample1      control    10    12     6    10    13    11    12    10     8
sample2      control     6    17     8    10    10    14    10     8     8
sample3      control    10    15     7    10     2    11     7    14    11
sample4      control    13     7     7    16    17     8    10     9     6
sample5      control    13    10    11     9    14     9    12     8    11
sample6      control    13     7     9    11    11    10     8     6     8
sample7      control     5    14     9     7     5    11     4    13    13
sample8      control    19    11    15    11    10     9    11    13    12
sample9      control    13     8    15     8    13    13     5    12     9
sample10     control    10    13     5    11     9     7     8    10     7
sample11        case    11    13    10     8    13    13     9     9     7
sample12        case    10    10    12    17    13    11    13     6     9
sample13        case     8    13    12    10    10    11    15    13    12
sample14        case    10    12    15    11    11     7     7    11    13


Convert our outcome to categorical and ensure that 'control' is the reference level:

fakedata$CaseControl <- factor(fakedata$CaseControl, levels=c("control","case"))


Check the distribution



Looks binomial to me; so, we will set the family as binomial during modelling.


What we can then do is fit a logistic regression model for each gene and derive a p-value via the Wald test. Here, I am just doing it for gene1. DESeq2 does this for each gene via fitNbinomGLMs.

model <- glm(CaseControl ~ gene1, data=fakedata, family=binomial(link="logit"))

Take a look at the model coefficients:

(Intercept)       gene1 
  2.3563115  -0.2374463


We then apply the Wald test on whichever model coefficient ('term') we want. Our gene is the second coefficient:

wald.test(b=coef(model), Sigma=vcov(model), Terms=2)
Wald test:

Chi-squared test:
X2 = 1.9, df = 1, P(> X2) = 0.17

Not statistically significant for gene1 in this random example.


Note that the Wald test can actually be used to derive a single p-value for the coefficients for more than 1 gene combined:

model <- glm(CaseControl ~ gene1 + gene5 + gene7, data=fakedata, family=binomial(link="logit"))

(Intercept)       gene1       gene5       gene7 
-0.25601459 -0.22555748 -0.03603168  0.29738610 

wald.test(b=coef(model), Sigma=vcov(model), Terms=2:4)
Wald test:

Chi-squared test:
X2 = 4.3, df = 3, P(> X2) = 0.23

Trust that this assists. Note that if your data has indeed been normalised to a negative binomial distribution, then you can fit a model via the glm.nb function from the MASS package. DESeq2's model functions are quite specific for that particular program, from what I gather.


ADD COMMENTlink modified 10 months ago • written 11 months ago by Kevin Blighe41k

Dear Kevin, Thanks a lot for your comprehensive answer. I really do appreciate it! I am going to try it right away. So far from what I have read, I just got one simple question, is there a way in R that I pass all my variables at once for my linear model instead of using ~ x1,+x2+.... ? or I should loop over it ?

Thanks a lot again !

ADD REPLYlink written 11 months ago by Duckula20

I believe you should loop over your genes and test each independently via the Wald test. After you have processed all genes, then you could think about creating a 'merged' model with multiple statistically significant genes in the same formula. That final formula then can be considered a preliminary 'gene signature', that needs to be further put to the test in terms of its predictive ability to distinguish or categorical variables of interest. I have posted some code on that topic, too: A: Resources for gene signature creation

To help with the creation of a loop, take a look here: One-way ANOVA in R for many observations Essentially you create a formula via the as.formula function, and supply that to glm. For the purposes of automation and ease of output, you can directly access the Wald test p-value with:

wald.test(b=coef(model), Sigma=vcov(model), Terms=2)$result$chi2[3]

For example, you can append that to an output file or simply add it to a vector or data frame with each loop.

ADD REPLYlink written 11 months ago by Kevin Blighe41k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1841 users visited in the last hour