I have a dataset with a variety of samples that vary in Age (Young/Old) and Sex (M/F).
I'm interested in testing a few hypotheses, including (Q1) "What genes are DE as a product of Sex?" and (Q2) "What genes are DE as a result of the interaction of Age and Sex?"
To answer Q1, I originally imported data like so:
# Attempt 1 myData <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable_forAllMySamples, directory = pathToHTSeq, design = ~Sex) dds <- DESeq(myData)
This produced a very large DE gene list.
Later, I redid this analysis with a different design matrix including interaction and contrasts, like so:
# Attempt 2 myData <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable_forAllMySamples, directory = pathToHTSeq, design = ~Sex+Age+Sex:Age) dds <- DESeq(myData ) res <- results(dds, contrasts=c('Sex', 'M', 'F'))
However, this produced a MUCH smaller list of DE genes.
My understanding is that pulling out the contrasts should look for the main effect of Sex in my dataset (so, Sex effects regardless of Timepoint).
I had expected that would be the same as if I just made design matrix ~Sex, but it looks like that isn't the case. Why is that?
It that because Attempt 2's design matrix "controls" for Age and any interaction effects, but Attempt 1 does not? Can anyone help me understand a bit better what is being tested in Attempt 1, or point me towards resources to strengthen my understanding of what that was doing?
Possibly relevant: When I PCA plotted my rlog-normalized data, the data clustered very well by Sex, and less well by Age.
Thank you very much for your help!