Question: Why Would P-Value Histogram Skew Right For Anovas?
4
Eric40 wrote:

I have a linear regression model on someone else's unpublished microarray data in which I'm testing the significance of certain 2nd and 3rd-order variable interactions via ANOVA on the nested model (i.e with and without the specific interactions). For many of the interactions, however, I am obtaining a p-value distribution that's skewed towards high p-values. Can incorrect model selection lead to something like that? Could it be because I encoded the model matrix incorrectly? Might the skewed distribution reflect some underlying structure in the data? Am I choosing the wrong model vs. null model? ## The Model

There are three cell types, dummy-encoded with the two variables cell.type.b and cell.type.c, and there are three drug treatment conditions (drug1, drug2, and drug1+drug2) encoded by the absence or presence of each drug.

``````intercept    cell.type.b    cell.type.c    drug1    drug2
1            0              0              0        0
1            0              0              0        1
1            0              0              1        0
1            0              0              1        1
1            1              0              0        0
1            1              0              0        1
1            1              0              1        0
1            1              0              1        1
1            0              1              0        0
1            0              1              0        1
1            0              1              1        0
1            0              1              1        1
``````

I also have (but did not include in the table for simplicity) the 6 interaction parameters

``````(cell.type.b + cell.type.c)*(drug1 + drug2 + drug1*drug2).
``````

The null hypothesis is that the model without a particular interaction cell.type x drug.type is not significantly worse than a model including that interaction.

## Calculating p-values

To calculate the p-values I have the following R function I essentially copied-and-pasted from the Leek and Storey sva R package. It's just a vectorized version of using the hat matrix to extract residuals. Models are input in matrices like the table above; my model has additional columns for the interactions, of course. In this case, there is a separate F statistic calculated for each row the data matrix (i.e. each gene of the microarray data set):

``````f.pvalue <- function(dat,mod,mod0){
# This is a function for performing
# parametric f-tests on the data matrix
# dat comparing the null model mod0
# to the alternative model mod.
n <- dim(dat)
m <- dim(dat)
df1 <- dim(mod)
df0 <- dim(mod0)
p <- rep(0,m)
Id <- diag(n)

resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
resid0 <- dat %*% (Id - mod0 %*% solve(t(mod0) %*% mod0) %*% t(mod0))

p <-  1-pf(fstats,df1=(df1-df0),df2=(n-df1))
return(p)
}
``````
statistics microarray R • 5.3k views
modified 20 months ago by RamRS27k • written 9.9 years ago by Eric40

Hi @Eric, I have seen this pattern in 2 different project while analyzing second level interactions using the R-Maanova package. I did not find any particular reason or explanation, so any sensible reply to your question is also of high interest to me :) Did you use any particular package for the analysis or did you simply use a standard anova test implemented in R? Cheers

Hi @Eric. I'm using the hat matrix to extract residuals from the microarrays and then directly calculate F statistics, so I'm using the standard R function `pf(q, df1, df2)` to get p-values.

1
David W4.7k wrote:

One possibility is that your models are missing an important covariate.

Imagine a simple case in which you are looking at 3 drugs across 3 cells types...

``````cells <- rep(c("A", "B", "C"), each=30)
drugs <- rep(c("I", "II","II"), 30)
``````

...and the drugs don't have an effect on what you're measuring but the cell-types do...

``````cell_type_effects <- c("A"=0, "B"=1, "C"=2)
drug_effects <- c("I"=0, "II"=0, "II"=0)
``````

...if you run a bunch of studies looking only at the drug-effects and failing to include the cell type you won't get a uniform distrbution of p-values, rather you'll ahve a right-skew

``````simulate_study <- function(){
y <- rnorm(90, cell_type_effects[cells] + drug_effects[drugs],1) # since all drug_effects are 0 true model  y ~ cell + normally distributed error
res <- anova(lm(y ~ drugs))[]
return(res)
}
pvals <- replicate(1e4,simulate_study())
hist(pvals)

mean(pvals < 0.1)
##  0.0327
mean(pvals > 0.9)
##  0.13
``````

It's not the only way you could wind up with the distribution of p-values you have. But it's probably worth thinking about

0
Ryan D3.3k wrote:

The most likely explanation--it would seem to me--is that you are violating some of the assumptions of the ANOVA test. I would check the normality of your signal data and try plotting the values of an equivalent non-parametric test like the Kruskal-Wallis.