Design matrix for microarray experiment with covariates, cofactors, multiple factors, and one treatment?
1
0
Entering edit mode
2.2 years ago
bkleiboeker ▴ 370

Hi all, I am working to analyze some microarray data (GSE116560) with hopes of comparing gene expression between patients on a ventilator versus those not on a ventilator for three different timepoints (ARDS diagnosis day 1, 4, or 8). My experimental matrix can be represented by design:

set.seed(123)
design<-data.frame("id"=1:68,"ARDS_day"=c(rep(1,30),rep(4,22),rep(8,16)),"age"=sample(18:84,replace=
T,size = 68),"gender"=rep(c("male","female"),68/2),"vfd_days"=rep(c(0,10,0,25),17))
design$vent_status<-ifelse(design$vfd_days==0,"on_vent","off_vent")
design$condition<-paste0(design$vent_status,"_",design\$ARDS_day)

design<-model.matrix(~age+gender+condition,design)
design


then I will fit the linear model and run emperical bayes as follows:

fit <- lmFit(y,design)
cont.matrix <- makeContrasts(day4_on_vs_off_vent = conditionon_vent_4 - conditionoff_vent_4,
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)


I am quite confused however. If anyone could help me with these questions I'd be greatly appreciative!

1. Should I use an intercept ("model.matrix(~1+age+gender+...")? I understand that this is usually best for numeric covariates (e.g. age) and shouldn't matter for gender, but feel unsure what I should do if I have both types.

2. How do I make the appropriate contrast for on vent vs. off vent for ARDS day1? Design has a column "conditionon_vent_1" but no column "conditionoff_vent_1." Does this mean "conditionoff_vent_1" is represented by (Intercept)?

I have been looking at the limma userguide extensively and still feel confused. Thanks for any guidance!

Brian

mircoarray R matrix design limma • 1.1k views
1
Entering edit mode
2.2 years ago
LChart 4.3k
1. Should I use an intercept ("model.matrix(~1+age+gender+...")? I understand that this is usually best for numeric covariates (e.g. age) and shouldn't matter for gender, but feel unsure what I should do if I have both types.

In this case it doesn't matter -- you have a categorial covariate (gender) in the model. If you don't include an intercept, R will encode this as a vector (0, 0, 0, 1, 1, 1) for male and (1, 1, 1, 0, 0, 0) for female. If you include an intercept, R will encode only (1, 1, 1, 0, 0, 0) for female (assuming the factor has alphabetical levels). In other words, R will either give you a model with BOTH indicators (no intercept) or only ONE indicator (if you include an intercept).

Your condition is also categorial, which also triggers this logic.

1. How do I make the appropriate contrast for on vent vs. off vent for ARDS day1?

Because your variable of interest was specified last, and there are multiple categorial variables, it had its first level dropped to avoid multicollinearity. To specify contrasts for a variable of interest always specify the variable of interest first. Compare:

> dat <- data.frame(
+   gender=c('m', 'f', 'm', 'f', 'm', 'f', 'm', 'f', 'm', 'f'),
+   condition=c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a')
+ )
>
> model.matrix(~ condition + gender + 0, dat)
conditiona conditionb conditionc genderm
1           1          0          0       1
2           0          1          0       0
3           0          0          1       1
4           1          0          0       0
5           0          1          0       1
6           0          0          1       0
7           1          0          0       1
8           0          1          0       0
9           0          0          1       1
10          1          0          0       0
>
> model.matrix(~ gender + condition + 0, dat)
genderf genderm conditionb conditionc
1        0       1          0          0
2        1       0          1          0
3        0       1          0          1
4        1       0          0          0
5        0       1          1          0
6        1       0          0          1
7        0       1          0          0
8        1       0          1          0
9        0       1          0          1
10       1       0          0          0

1. Does this mean "conditionoff_vent_1" is represented by (Intercept)?

In a way yes, and in a way no. You can certainly get the effect by contrasting against the intercept, but the algebra is slightly different as in this case the model matrix looks like

> model.matrix(~ condition + gender + 1, dat)
(Intercept) conditionb conditionc genderm
1            1          0          0       1
2            1          1          0       0
3            1          0          1       1
4            1          0          0       0
5            1          1          0       1
6            1          0          1       0
7            1          0          0       1
8            1          1          0       0
9            1          0          1       1
10           1          0          0       0


If you want to contrast conditiona and conditionb you need to specify an equation that, when applied to the above indicators, would give you 1 -1 0 1 -1 0 1 -1 0 1 (the encoding of conditiona- the encoding of conditionb).

0
Entering edit mode

This is extremely helpful, thank you. Do you have any insight into why it's best to specify the variable of interest first?

I think I have figured out what I needed to with the help of your answer!

1
Entering edit mode

When there are multiple categorial variables, at most one can retain all of its levels in the model. If two or more variables are "one-hot encoded" (made into indicator variables) it results in a condition called "multicollinearity" which means the linear model cannot be fit, because effects are non-identifiable (see https://discourse.pymc.io/t/real-life-example-on-housing-price-regression-advice-requested/3050/5).

To address this common issue, model.matrix will retain all levels of the 1st variable; and drop the 1st level of every subsequent variable.

Therefore, to ensure your variable of interest retains all of its levels, specify it first.