**40**wrote:

I have a linear regression model on someone else's unpublished microarray data in which I'm testing the significance of certain 2nd and 3rd-order variable interactions via ANOVA on the nested model (i.e with and without the specific interactions). For many of the interactions, however, I am obtaining a p-value distribution that's skewed towards high p-values. Can incorrect model selection lead to something like that? Could it be because I encoded the model matrix incorrectly? Might the skewed distribution reflect some underlying structure in the data? Am I choosing the wrong model vs. null model?

## The Model

There are three cell types, dummy-encoded with the two variables cell.type.b and cell.type.c, and there are three drug treatment conditions (drug1, drug2, and drug1+drug2) encoded by the absence or presence of each drug.

```
intercept cell.type.b cell.type.c drug1 drug2
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
1 0 0 1 1
1 1 0 0 0
1 1 0 0 1
1 1 0 1 0
1 1 0 1 1
1 0 1 0 0
1 0 1 0 1
1 0 1 1 0
1 0 1 1 1
```

I also have (but did not include in the table for simplicity) the 6 interaction parameters

```
(cell.type.b + cell.type.c)*(drug1 + drug2 + drug1*drug2).
```

The null hypothesis is that the model without a particular interaction cell.type x drug.type is not significantly worse than a model including that interaction.

## Calculating p-values

To calculate the p-values I have the following R function I essentially copied-and-pasted from the Leek and Storey sva R package. It's just a vectorized version of using the hat matrix to extract residuals. Models are input in matrices like the table above; my model has additional columns for the interactions, of course. In this case, there is a separate F statistic calculated for each row the data matrix (i.e. each gene of the microarray data set):

```
f.pvalue <- function(dat,mod,mod0){
# This is a function for performing
# parametric f-tests on the data matrix
# dat comparing the null model mod0
# to the alternative model mod.
n <- dim(dat)[2]
m <- dim(dat)[1]
df1 <- dim(mod)[2]
df0 <- dim(mod0)[2]
p <- rep(0,m)
Id <- diag(n)
resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
resid0 <- dat %*% (Id - mod0 %*% solve(t(mod0) %*% mod0) %*% t(mod0))
rss1 <- resid^2 %*% rep(1,n)
rss0 <- resid0^2 %*% rep(1,n)
fstats <- ((rss0 - rss1)/(df1-df0))/(rss1/(n-df1))
p <- 1-pf(fstats,df1=(df1-df0),df2=(n-df1))
return(p)
}
```

Hi @Eric, I have seen this pattern in 2 different project while analyzing second level interactions using the R-Maanova package. I did not find any particular reason or explanation, so any sensible reply to your question is also of high interest to me :) Did you use any particular package for the analysis or did you simply use a standard anova test implemented in R? Cheers

10kHi @Eric. I'm using the hat matrix to extract residuals from the microarrays and then directly calculate F statistics, so I'm using the standard R function

`pf(q, df1, df2)`

to get p-values.27k• written 9.9 years ago by Eric •40