Question: How is the design in DESeq2 work?
8
3.2 years ago by
MAPK1.7k
MAPK1.7k wrote:

Hi All, I am working on RNAseq data analysis using DESeq2 R package. I havemy code `dds <- DESeqDataSetFromMatrix(countData = count.mat , colData = cond, design = ~Strain + Time)` to create the matrix. My confusion is with the design formula `design = ~Strain + Time` where I have `Strain` and `Time` variables to compare in my colData for my countData matrix.

This is my `cond` matrix

``````           Strain Time
count1           1    1
count2           1    1
count4           1    2
count5           1    2
count13          2    1
count14          2    1
count16          2    2
count17          2    2
countX           2    3
``````

First, I want to understand the difference between these four designs that could go in the function `DESeqDataSetFromMatrix`:

``````a) `design = ~Strain + Time + Strain:Time`
b) `design = ~Strain + Time`
c) `design = ~Time`
and d) `design = ~Strain`
``````

Second,

My understanding is that the DESeq2 takes the last variable in the design formula (here Time) as a control variable, so to test for different samples in Time group, I have these codes below. So, I want to know what the outputs of `resultNames(ddsTC)` really mean?

``````ddsTC <- DESeqDataSet(dds, ~ Strain + Time ) ##For time
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~Time )    #For Time

resultsNames(ddsTC)
[1] "Intercept.1"     "Time_3_vs_1.1"   "Time_2_vs_1.1"   "Strain_2_vs_1.1"
``````
deseq2 • 11k views
ADD COMMENTlink
modified 3.2 years ago by Devon Ryan98k • written 3.2 years ago by MAPK1.7k
3

Hey, To the best of my knockledge A) `design = ~Strain + Time` means that deseq2 will test the effect of the Time (the last factor), controlling for the effect of Strain (the fi rst factor), so that the algorithm returns the fold change result only from the effect of time. B) `design = ~Time` here the algorithm will return the fold change that result from time without correcting for fold change that result from strain C) `design = ~Strain` same as above

So in my understanding Deseq2 treats the first factor as a co-variate and tries to eliminate the fold change that result because of this co-variate.

ADD REPLYlink written 3.2 years ago by tarek.mohamed270
1

Sorry to up this message but when we apply the design " design = ~Strain + Time ", how exactly deseq2 control the effect of Strain in order to test the Time ? It's not clear for me ..

ADD REPLYlink written 19 months ago by darbinator230
1

It's part of the GLM, that's how.

ADD REPLYlink written 19 months ago by Devon Ryan98k

Thank you so much for your answer. So what does `design = ~Strain + Time + Strain:Time` mean? Also, do you know what are the outputs of `resultsNames(ddsTC)` comparing?

ADD REPLYlink written 3.2 years ago by MAPK1.7k
3

Regarding 'design = ~Strain + Time + Strain:Time` ,

Here you added an interaction term (how time is interacting with stain in relation to regulation of gene expression). So this design will return the effect of time on the reference level of strain (1 or 2 depends on your setting). Using contrast () you can look for the effect of time on the other level in "strain"

Alternatively you can group the strain (with its different levels) and time ((with its different levels) into one factor, lets call it ALL. and by using contrast () you can look for the difference in log2 fold change between any combination of levels.

dds\$ALL <- factor(paste0(dds\$time, dds\$strain)) design(dds) <- ~ ALL

dds <- DESeq(dds) resultsNames(dds) results(dds, contrast=c("ALL","12", "11"))

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by tarek.mohamed270
10
3.2 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

This ends up not being a questions about DESeq2, but about how linear models work (the same nomenclature is used for ANOVAs). tarek.mohamed basically explained this in the comments, but I'll point out that everything from limma to a standard ANOVA works exactly like this.

My understanding is that the DESeq2 takes the last variable in the design formula (here Time) as a control variable, so to test for different samples in Time group, I have these codes below. So, I want to know what the outputs of resultNames(ddsTC) really mean?

It doesn't take the last variable as a control, it just uses that one for plotting and output by default. So if you do `design=Strain + Time` and then do `results()` then the results will be for the `Time` effect.

`resultNames()` is giving you the names of the coefficients in your model.

``````ddsTC <- DESeqDataSet(dds, ~ Strain + Time ) ##For time
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~Time )    #For Time
``````

Note that you're testing the effect of `Strain` above, not `Time` as suggested by the comment.

I strongly recommend familiarizing yourself with basic linear models. A pretty hefty chunk of the statistics used in bioinformatics boils down to that.

ADD COMMENTlink written 3.2 years ago by Devon Ryan98k
4

Just posted this on twitter, but cross-posting here too:

The syntax is originally from here: https://www.jstor.org/stable/2346786?seq=1#metadata_info_tab_contents

It is sometimes called Wilkinson (& Rogers) syntax.

In the vignette, there are diagrams on interpreting interactions. A brief introduction to the design formula is in the workflow (which proceeds at a slower pace than the vignette). There are numerous design examples in `?results`. Users often miss the help pages, but these are a good resource for learning how the functions work.

Also see our edX material on linear models (Week 3):

http://rafalab.github.io/pages/harvardx.html

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Michael Love2.1k

Is there a new version of that class still offered?

ADD REPLYlink written 22 months ago by ariel110
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1276 users visited in the last hour