Question: How is the design in DESeq2 work?
0
gravatar for MAPK
8 months ago by
MAPK1.2k
United States
MAPK1.2k wrote:

Hi All, I am working on RNAseq data analysis using DESeq2 R package. I havemy code dds <- DESeqDataSetFromMatrix(countData = count.mat , colData = cond, design = ~Strain + Time) to create the matrix. My confusion is with the design formula design = ~Strain + Time where I have Strain and Time variables to compare in my colData for my countData matrix.

This is my cond matrix

           Strain Time
count1           1    1
count2           1    1
count4           1    2
count5           1    2
count13          2    1
count14          2    1
count16          2    2
count17          2    2
countX           2    3

First, I want to understand the difference between these four designs that could go in the function DESeqDataSetFromMatrix:

a) `design = ~Strain + Time + Strain:Time`
b) `design = ~Strain + Time`
c) `design = ~Time`
and d) `design = ~Strain`

Second,

My understanding is that the DESeq2 takes the last variable in the design formula (here Time) as a control variable, so to test for different samples in Time group, I have these codes below. So, I want to know what the outputs of resultNames(ddsTC) really mean?

ddsTC <- DESeqDataSet(dds, ~ Strain + Time ) ##For time
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~Time )    #For Time

resultsNames(ddsTC) 
 [1] "Intercept.1"     "Time_3_vs_1.1"   "Time_2_vs_1.1"   "Strain_2_vs_1.1"
deseq2 • 547 views
ADD COMMENTlink modified 8 months ago by Devon Ryan80k • written 8 months ago by MAPK1.2k
2

Hey, To the best of my knockledge A) design = ~Strain + Time means that deseq2 will test the effect of the Time (the last factor), controlling for the effect of Strain (the fi rst factor), so that the algorithm returns the fold change result only from the effect of time. B) design = ~Time here the algorithm will return the fold change that result from time without correcting for fold change that result from strain C) design = ~Strain same as above

So in my understanding Deseq2 treats the first factor as a co-variate and tries to eliminate the fold change that result because of this co-variate.

ADD REPLYlink written 8 months ago by tarek.mohamed140

Thank you so much for your answer. So what does design = ~Strain + Time + Strain:Time mean? Also, do you know what are the outputs of resultsNames(ddsTC) comparing?

ADD REPLYlink written 8 months ago by MAPK1.2k
1

Regarding 'design = ~Strain + Time + Strain:Time` ,

Here you added an interaction term (how time is interacting with stain in relation to regulation of gene expression). So this design will return the effect of time on the reference level of strain (1 or 2 depends on your setting). Using contrast () you can look for the effect of time on the other level in "strain"

Alternatively you can group the strain (with its different levels) and time ((with its different levels) into one factor, lets call it ALL. and by using contrast () you can look for the difference in log2 fold change between any combination of levels.

dds$ALL <- factor(paste0(dds$time, dds$strain)) design(dds) <- ~ ALL

dds <- DESeq(dds) resultsNames(dds) results(dds, contrast=c("ALL","12", "11"))

ADD REPLYlink modified 8 months ago • written 8 months ago by tarek.mohamed140
2
gravatar for Devon Ryan
8 months ago by
Devon Ryan80k
Freiburg, Germany
Devon Ryan80k wrote:

This ends up not being a questions about DESeq2, but about how linear models work (the same nomenclature is used for ANOVAs). tarek.mohamed basically explained this in the comments, but I'll point out that everything from limma to a standard ANOVA works exactly like this.

My understanding is that the DESeq2 takes the last variable in the design formula (here Time) as a control variable, so to test for different samples in Time group, I have these codes below. So, I want to know what the outputs of resultNames(ddsTC) really mean?

It doesn't take the last variable as a control, it just uses that one for plotting and output by default. So if you do design=Strain + Time and then do results() then the results will be for the Time effect.

resultNames() is giving you the names of the coefficients in your model.

ddsTC <- DESeqDataSet(dds, ~ Strain + Time ) ##For time
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~Time )    #For Time

Note that you're testing the effect of Strain above, not Time as suggested by the comment.

I strongly recommend familiarizing yourself with basic linear models. A pretty hefty chunk of the statistics used in bioinformatics boils down to that.

ADD COMMENTlink written 8 months ago by Devon Ryan80k

I find it pretty astonishing (but also symptomatic of the entire R universe) that the official DESeq2 tutorials neither bother to explain the nomenclature by themselves, or even provide pointers as to where one can read up on it." Analyzing RNA-seq data with DESeq2" mentions ANOVA exactly once, and quite a distance away from where they introduce designs. And when they talk about designs, they just sort of head into it without any asides. I understand that formulae like these are pretty standard for R, but with packages like these, you likely get a lot of non-programmer crowd, Python crowd etc, who have never seen this and would not know what they need to google to find out...

ADD REPLYlink modified 2 days ago • written 2 days ago by kirill.grigorev0

I'd be astonished if a specialized package like DESeq2 bothered to explain standard statistics nomenclature, it's sort of assumed that if you're going to be using a package for statistical analysis that you'll have a basic background in statistics. I find that assumption entirely fair.

ADD REPLYlink written 2 days ago by Devon Ryan80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1453 users visited in the last hour