During my PhD and early postdoc I have performed a differential gene expression analysis more than once. In each case, I have used the classic workflow including htseq + DESeq2.
I know how it works, I know the pitfalls, I know the possible biases, but one theoretical question still stands still for me, although I have read the DESeq2 manual more than once.
If I have a dataset that looks like this:
Samples Timepoint Treatment Replicate Sample_1 D1 A R1 Sample_2 D1 A R2 Sample_3 D1 A R3 Sample_4 D1 B R1 Sample_5 D1 B R2 Sample_6 D1 B R3 Sample_7 D2 A R1 Sample_8 D2 A R2 Sample_9 D2 A R3 Sample_10 D2 B R1 Sample_11 D2 B R2 Sample_12 D2 B R3
And I want to compare Treatment B vs Treatment A in both timepoints:
D1, B vs A D2, B vs A
I would set the
design variable in
DESeqDataSetFromMatrix this way: I would combine
Treament into a column called
Condition which contains entries such as
c("D1_A", "D1_B", "D2_A", "D2_B"). These are indeed the four "conditions" that I want to compare.
Then, I would call it like
DESeqDataSetFromMatrix( ... , design = ~ Condition).
However, the DESeq2 manuals also show other ways of treating your samples. For example, I could have specified something like
DESeqDataSetFromMatrix( ... , design = ~ Timepoint + Treatment + Timepoint:Treatment). What I don't understand is what does the third part of this design syntax mean, in mathematical and biological terms.
Does it indicate a scenario where the Timepoint and the Treatment columns are not entirely independent, in the experimental setup?
(e.g. if "treatment" was "viral load" and I could expect it to grow with time regardless of how I treat the samples).