DEseq2 design matrix with 3 factors
0
5
Entering edit mode
2.3 years ago
kand3e ▴ 50

Hi Everyone,

I'm a complete novice to DEG analysis and linear models and I have some questions regarding the setup of the design matrix. I have read up some posts in this forum with similar experimental design, but they don't really have the answers I'm looking for. My experiment was designed as follows:

1) Two genotype groups (Genotype: WT vs. KO)

2) Two treatments conditions for each genotype group (Condition: Ctrl vs. Trt)

3) Equal number of both sex in each genotype group under each treatment condition (Sex: F vs. M)

When we first designed this experiment, sex was not a factor we considered and the main purpose was just to see whether the expression profiles of the two genotypes differ at steady-state (ctrl) and after stimulation (trt). We included equal number of both sex in each group just in case of sex bias. However, when we did PCA analysis, we actually saw some differences between sex in each genotype, and this difference is further increased after treatment.

Now, the questions we would like to answer are:

1) If we just want to see how genotype and treatment interact (E.g.: Ctrl WT vs Trt WT || Ctrl KO vs Trt KO || Ctrl WT vs Ctrl KO || Trt WT vs Trt KO), should I use a design=~Genotype+Condition+Genotype:Condition and follow the comparison setups here

or now knowing there are variations in sex, use a design=~Sex+Genotype+Condition+Genotype:Condition (to take care of differences in sex) and still follow the same comparison setups as indicated in the link above?

2) If we also want to see how gene expression differs between sex within a genotype group and between two genotype groups under each treatment condition (e.g. F vs M in Ctrl WT || F vs M in Ctrl KO || F vs M in Trt WT || F vs M in Trt KO || F Ctrl WT vs F Ctrl KO || M Ctrl WT vs M Ctrl KO || F Trt WT vs F Trt KO || M Trt WT vs M Trt KO), how should I set up the design matrix? I have very limited knowledge on how interaction terms work and I'm not sure what I should do in order to get all those comparisons. I would really appreciate it if someone can provide some advise.

Also, I've read in some other posts that for complex design such as this, maybe it's better to name each sample using all three factors (e.g. F_Ctrl_WT, F_Trt_WT etc.) and just use the "contrast" command to call out the groups I'm interested in comparing. Will this work? How is this different than using the "~A+B+C+A:C+B:C" type of setup?

Thanks so much for your help!

rna-seq deseq2 Forum • 2.4k views
0
Entering edit mode

For 1) I would indeed use a factorial model as this makes it as easy as ~factor followed by making all contrasts you want. In order to keep things simple, wouldn't it be desirable to make a new model for 2), again full factorial, so e.g. F_Ctrl_WT vs M_Ctrl_WT? See the DESeq2 manual, it talks about interactions and identical factorial designs.

0
Entering edit mode

Thanks ATpoint! So for question 1), should I add the "~Sex" to take care of variation in gene expression between sex or just leave it as ~Genotype+Condition+Genotype:Condition? I actually tried running both matrix as a test, and I do identify a bit more DEGs with the '~Sex' included than not having it. Do you know why that might be?

0
Entering edit mode

Try to determine, first, if Sex is a confounding factor. Leave it out of the formula and then generate a PCA bi-plot. If you notice any stratification based on Sex, then maybe include it in the design formula. These types of things are 'executive' decisions that you as an analyst will have to make repeatedly in your career.

By leaving it in your formula, you are essentially then 'controlling for' the effect of Sex when deriving test statistics for Condition / Genotype. However, you would not want to control for something if it's not necessary.

0
Entering edit mode

Hi Kevin, Thanks so much for your suggestion.

We did the PCA analysis, and the samples are separated by both genotype (smaller separation) and treatment (larger separation) on PC1, and on PC2 we do see very distinct separation of male and female. Therefore, I guess I should control for the effect of Sex in the formula. But what I couldn't really understand is why after I add in the 'sex' effect in the design matrix, the list of DEGs I get is actually even longer than if I leave it out, I thought it would be the other way around :S

Anyway, would you have any recommendations on how I should go about targeting question 2? Those are the specific pair-wise comparisons we are interested in, so it will be great if you can provide me some insights on how to setup the matrix design to do those comparisons. Thanks a lot for your help!

0
Entering edit mode

I am not the best to answer in detail on why you would find more statistically significantly differentially expressed genes after including Sex in the design formula; however, try to think of it this way: by not including it and not controlling for the effect of sex, the true condition and genotypic effect of some genes will actually be masked by differences relating to sex, differences which only become apparent after you control for it [sex]. I am trying to think of an example but somewhat struggling... Best thing would be to actually produce box-plots of these 'new' genes and see how their profiles different according to all parameters in the design formula.

I think that, for the second part, you may need an interaction term. However, you already have an interaction term, from what I understand. In that case, you may have to create a new 'merged' parameter, like, GenotypeSex, and use that in an interaction with Condition.

There are very good examples listed at the end of the manual entry page for DESeq2::results. Have you looked there? - just type ?DESeq2::results in the terminal and scroll down to the end of the entry page

0
Entering edit mode

Hi Kand3e,

I have a problem similar to yours, getting more DEGs with an extra factor included in the design than not having it. After 10 months do you have an answer for question 1) ?

2
Entering edit mode

There is no answer. It seems that you may require advice from a statistician; so, I would encourage you to seek that locally at your university. A fundamental understanding of regression model formulae would also help.

For example:

• ~ condition means that we are testing condition
• ~ condition + sex means that we are testing condition, while adjusting for the effects of sex
• ~ condition + sex + BMI means that we are testing condition, while adjusting for the effects of sex and BMI
0
Entering edit mode

https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/homework/DGE_assignment_2_answer_key.R in this tutorial the last term is what they saying testing while adjusting the effect of others are you saying similar im confused...