Question: Why Would P-Value Histogram Skew Right For Anovas?
gravatar for Eric
9.9 years ago by
Princeton, NJ
Eric40 wrote:

I have a linear regression model on someone else's unpublished microarray data in which I'm testing the significance of certain 2nd and 3rd-order variable interactions via ANOVA on the nested model (i.e with and without the specific interactions). For many of the interactions, however, I am obtaining a p-value distribution that's skewed towards high p-values. Can incorrect model selection lead to something like that? Could it be because I encoded the model matrix incorrectly? Might the skewed distribution reflect some underlying structure in the data? Am I choosing the wrong model vs. null model?

P-value histogram

The Model

There are three cell types, dummy-encoded with the two variables cell.type.b and cell.type.c, and there are three drug treatment conditions (drug1, drug2, and drug1+drug2) encoded by the absence or presence of each drug.

intercept    cell.type.b    cell.type.c    drug1    drug2
1            0              0              0        0
1            0              0              0        1
1            0              0              1        0
1            0              0              1        1
1            1              0              0        0
1            1              0              0        1
1            1              0              1        0
1            1              0              1        1
1            0              1              0        0
1            0              1              0        1
1            0              1              1        0
1            0              1              1        1

I also have (but did not include in the table for simplicity) the 6 interaction parameters

(cell.type.b + cell.type.c)*(drug1 + drug2 + drug1*drug2).

The null hypothesis is that the model without a particular interaction cell.type x drug.type is not significantly worse than a model including that interaction.

Calculating p-values

To calculate the p-values I have the following R function I essentially copied-and-pasted from the Leek and Storey sva R package. It's just a vectorized version of using the hat matrix to extract residuals. Models are input in matrices like the table above; my model has additional columns for the interactions, of course. In this case, there is a separate F statistic calculated for each row the data matrix (i.e. each gene of the microarray data set):

f.pvalue <- function(dat,mod,mod0){
  # This is a function for performing
  # parametric f-tests on the data matrix
  # dat comparing the null model mod0
  # to the alternative model mod. 
  n <- dim(dat)[2]
  m <- dim(dat)[1]
  df1 <- dim(mod)[2]
  df0 <- dim(mod0)[2]
  p <- rep(0,m)
  Id <- diag(n)

  resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
  resid0 <- dat %*% (Id - mod0 %*% solve(t(mod0) %*% mod0) %*% t(mod0))

  rss1 <- resid^2 %*% rep(1,n)
  rss0 <- resid0^2 %*% rep(1,n)

  fstats <- ((rss0 - rss1)/(df1-df0))/(rss1/(n-df1))
  p <-  1-pf(fstats,df1=(df1-df0),df2=(n-df1))
statistics microarray R • 5.3k views
ADD COMMENTlink modified 20 months ago by RamRS27k • written 9.9 years ago by Eric40

Hi @Eric, I have seen this pattern in 2 different project while analyzing second level interactions using the R-Maanova package. I did not find any particular reason or explanation, so any sensible reply to your question is also of high interest to me :) Did you use any particular package for the analysis or did you simply use a standard anova test implemented in R? Cheers

ADD REPLYlink written 9.9 years ago by Eric Normandeau10k

Hi @Eric. I'm using the hat matrix to extract residuals from the microarrays and then directly calculate F statistics, so I'm using the standard R function pf(q, df1, df2) to get p-values.

ADD REPLYlink modified 20 months ago by RamRS27k • written 9.9 years ago by Eric40
gravatar for David W
7.0 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

One possibility is that your models are missing an important covariate.

Imagine a simple case in which you are looking at 3 drugs across 3 cells types...

cells <- rep(c("A", "B", "C"), each=30) 
drugs <- rep(c("I", "II","II"), 30)

...and the drugs don't have an effect on what you're measuring but the cell-types do...

cell_type_effects <- c("A"=0, "B"=1, "C"=2)
drug_effects <- c("I"=0, "II"=0, "II"=0)

...if you run a bunch of studies looking only at the drug-effects and failing to include the cell type you won't get a uniform distrbution of p-values, rather you'll ahve a right-skew

simulate_study <- function(){
   y <- rnorm(90, cell_type_effects[cells] + drug_effects[drugs],1) # since all drug_effects are 0 true model  y ~ cell + normally distributed error
   res <- anova(lm(y ~ drugs))[[5]][1]
 pvals <- replicate(1e4,simulate_study())

mean(pvals < 0.1)
## [1] 0.0327
mean(pvals > 0.9)
## [1] 0.13

It's not the only way you could wind up with the distribution of p-values you have. But it's probably worth thinking about

ADD COMMENTlink modified 20 months ago by RamRS27k • written 7.0 years ago by David W4.7k
gravatar for Ryan D
7.0 years ago by
Ryan D3.3k
Ryan D3.3k wrote:

The most likely explanation--it would seem to me--is that you are violating some of the assumptions of the ANOVA test. I would check the normality of your signal data and try plotting the values of an equivalent non-parametric test like the Kruskal-Wallis.

ADD COMMENTlink written 7.0 years ago by Ryan D3.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1617 users visited in the last hour