Question: Making a contrast matrix with multiple co-variates for differential expression analysis
Hi, there!

I am trying to build a contrast matrix, in order to run a fit linear model. It is a basic comparison between different histologic types of tumors - benign or BL; early stage; late stage. And the goal here is to investigate whether FFPE (formalin-fixed) material differs from fresh-frozen material in terms of methylation pattern (we're using the illumina's EPIC). To that end, we collected FFPE and fresh-frozen samples from the same patient.

The basic experiment looks something like this:

> clindata
   Subject Material_Source  Tumor_stage        ID2
1     P235            FFPE Benign_or_BL  P235_FFPE
2     P432            FFPE Benign_or_BL  P432_FFPE
3     P421            FFPE        Early  P421_FFPE
4      P93            FFPE        Early   P93_FFPE
5     P876            FFPE        Early  P876_FFPE
6     P543            FFPE         Late  P543_FFPE
7     P532            FFPE         Late  P532_FFPE
8     P152            FFPE         Late  P152_FFPE
9     P235           Fresh Benign_or_BL P235_Fresh
10    P432           Fresh Benign_or_BL P432_Fresh
15    P421           Fresh        Early P421_Fresh
16     P93           Fresh        Early  P93_Fresh
17    P876           Fresh        Early P876_Fresh
24    P543           Fresh         Late P543_Fresh
25    P532           Fresh         Late P532_Fresh
26    P152           Fresh         Late P152_Fresh

Where clindata$Subject refers to patient ID; and the following 2 columns refers to the source of material and tumor stage, respectively. clindata$ID2 is a merge between values in clindata$Subject and clindata$Material_Source.

So, now comes my question: How to build the contrast matrix for comparison between different tumor stages, but accounting for the patient and material source variables?

My idea is the following:

#preparing data:
> TS <- factor(clindata$Material_Source)
> SubMS <- factor(clindata$ID2)

#designing the matrix:
design <- model.matrix(~0+Tumor_stage+ID2, data=clindata)
colnames(design) <- c(levels(TS), levels(SubMS)[-1])

I can run the lmFit() and makeContrasts() functions after that, together with the array data. Now, of course the n for each group is rather small, but this is just an example (there will be more samples added to each group on the final experiment). But my question is:

Does that design make sense?

Would you suggest anything different (e.g. (A) treat all 3 classes separately, instead of merging the 2 co-variates as one co-variate; or (B) consider only the "Subject" group as a co-variate, since the pairwise comparison would already account for one sample being FFPE and the other fresh-frozen)?

Any help is greatly appreciated here. Thanks!

