Question: Time course experiments in DESeq2
1
gravatar for Chris Gene
2.6 years ago by
Chris Gene60
United Kingdom
Chris Gene60 wrote:

Hello, I'm having a hard time understanding the time course experimental work flow in DESeq2. I am using the F1000 paper as a template (Love MI, Anders S, Kim V and Huber W. RNA-Seq workflow: gene-level exploratory analysis and differential expression [version 1; referees: 2 approved]. F1000Research 2015, 4:1070 (doi: 10.12688/f1000research.7035.1))

Experiment: I have 2 genotypes, 3 timepoints per genotype. 4 replicates in each genotype-condition pair (3 in one). I have a counts table generated in HTSeq - 23 samples.

coldata <- data.frame(sample=colnames(data),
                          genotype=as.factor(c(rep("het",4), rep("wt",4), rep("het",4), rep("wt",4),rep("het",3), rep("wt",4))),
                          time=as.factor(c(rep("es",8), rep("np",8),rep("nsc",7))))

sample genotype time 1 es1 het es
2 es2 het es
3 es3 het es
4 es4 het es
5 es5 wt es
6 es6 wt es
7 es7 wt es
8 es8 wt es
9 np1 het np
10 np2 het np
11 np3 het np
12 np4 het np
13 np5 wt np
14 np6 wt np
15 np7 wt np
16 np8 wt np
17 nsc1 het nsc
18 nsc2 het nsc
19 nsc4 het nsc
20 nsc5 wt nsc
21 nsc6 wt nsc
22 nsc7 wt nsc
23 nsc8 wt nsc

I have tried it 2 different ways, with slightly different results. My questions are: 1. What is the difference between the two? I have used both elements as factors. 2. It's not clear how the relevel is worning under these conditions 3. Using the same results table, can I compare the intermediate timepoints, i.e, 1st and 2nd then 2nd and 3rd, generating a esults table for each, or would I have to separate the analysis entirely?

Thanks!

Here goes what I have done:

dds <- DESeqDataSetFromMatrix(countData=data, colData=coldata, design=~ time + genotype:time + genotype)

colData(dds)$genotype <- relevel(colData(dds)$genotype, "wt")

ddsTC <- DESeqDataSetFromMatrix(countData=data, colData=coldata, 
                                design=~ time + genotype:time + genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ time + genotype)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)

Results: log2 fold change (MLE): timensc.genotypewt LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype' DataFrame with 4 rows and 6 columns

summary(resTC)

out of 29372 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 9, 0.031% LFC < 0 (down) : 20, 0.068% outliers [1] : 6, 0.02% low counts [2] : 9434, 32% (mean count < 3) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

resultsNames(ddsTC)

[1] "Intercept" "genotype_wt_vs_het" "condition_np_vs_es"
[4] "condition_nsc_vs_es" "genotypewt.conditionnp" "genotypewt.conditionnsc"

Second iteration:

resTC2 <- results(ddsTC, contrast=c("genotype","het","wt"))
head(resTC2[order(resTC2$padj),],4)

log2 fold change (MLE): genotype het vs wt LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype' DataFrame with 4 rows and 6 columns

summary(resTC2)

out of 29372 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 16, 0.054% LFC < 0 (down) : 13, 0.044% outliers [1] : 6, 0.02% low counts [2] : 9434, 32% (mean count < 3)

> sessionInfo()

R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.5 (Yosemite)

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] sva_3.18.0 genefilter_1.52.1
[3] mgcv_1.8-12 nlme_3.1-126
[5] PoiClaClu_1.0.2 hexbin_1.27.1
[7] RColorBrewer_1.1-2 vsn_3.38.0
[9] ggplot2_2.1.0 pheatmap_1.0.8
[11] DESeq2_1.10.1 RcppArmadillo_0.6.100.0.0 [13] Rcpp_0.12.4 SummarizedExperiment_1.0.2 [15] Biobase_2.30.0 GenomicRanges_1.22.4
[17] GenomeInfoDb_1.6.3 IRanges_2.4.8
[19] S4Vectors_0.8.11 BiocGenerics_0.16.1
[21] maptools_0.8-39 sp_1.2-2

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Chris Gene60
2
gravatar for Michael Love
2.6 years ago by
Michael Love1.7k
United States
Michael Love1.7k wrote:

hi Chris,

See the note in the function help page ?results about the LRT p-values, and then also read the section in the DESeq2 vignette about the likelihood ratio test:

vignette("DESeq2")

Note that you're only changing the log fold change not the p-values with your code. You should perform a Wald test on the interaction terms for individual time points if you are interested in testing those:

results(dds, name="...", test="Wald")

Also, you should set WT to the reference level, so the comparisons are het vs WT not WT vs het. Read the DESeq2 vignette on factor levels.

ADD COMMENTlink written 2.6 years ago by Michael Love1.7k
0
gravatar for Charles Warden
2.6 years ago by
Charles Warden5.8k
Duarte, CA
Charles Warden5.8k wrote:

I would say it is OK to use DESeq2/DESeq/edgeR for comparison of two timepoints (or some sort of analysis that focuses around comparing two groups, which admittedly looks like it applies in this case), and you can Google come recommendations on how to set up time-series analysis with those packages.

However, I would recommend using limma-voom if you want to identify expression changes over time (due to a continuous variable): Maybe you will have better luck with this package.

https://bioconductor.org/packages/release/bioc/html/limma.html

See pages 69 and 46 in the limma usersguide.pdf file

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Charles Warden5.8k
2

DESeq2 and edgeR are both capable of analyzing complex designs, such as time series with interaction terms, leveraging the same base R functionality as limma: the formula class and the model.matrix function. Where did you pick up that DESeq2 or edgeR are only for two group comparisons? You can even include smoothing functions like splines for DESeq2, edgeR or limma equally.

ADD REPLYlink written 2.6 years ago by Michael Love1.7k

DESeq and edgeR both use a negative binomial distribution to calculate p-values. Even in the example above, time is being defined as a factor and not a continuous, numeric variable.

If the goal was to identify genes that change as a function of time without the genotype information, this would be problematic. In this case, "time" is defined a little differently than I would expect (as "es", "np", and "esc"). If instead time was defined as a continuous variable (I don't know what is appropriate in this case, but if you timepoints of 1, 2, and 3 days and you wanted a single p-value to reflect the change in expression over time), then I would expect DEseq to give an error message. For example, I get this error message if I try to run analysis on just time (for a different set of samples):

> dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~ time)
converting counts to integer mode
the design formula contains a numeric variable with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

If I go back to your tutorial (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670015/#__sec31title ), the question that I am concerned about would be "Can you use DESeq2 to identify changes that occur as a function of time (in this case, minutes) in only the WT samples?"

However, I do apologize for being off-topic due to 1) the way time is defined here and 2) the fact that there are still different genotypes being compared. The tutorial is an excellent resource for people whose goal is to identify genes that change between groups at multiple time points.

ADD REPLYlink modified 2.5 years ago • written 2.6 years ago by Charles Warden5.8k
2

Thanks for replying, and I hope I can clear some things up:

The fact that DESeq2 and edgeR use a NB and limma-voom uses a log linear model isn't really relevant here.

It's fine to use time as a continuous variable. For example, you could use a design of ~genotype + f(time) + genotype:f(time) where you use some smooth function for f, such as smoothing splines, to find genes where there is a genotype-specific change in the dependence of expression on time. Or you can use ~f(time) and compare to ~1, or whatever you like. There are many ways in base R to specific this f, ns(), bs(), poly() etc., and users need to be familiar with these approaches to produce reasonable results.

That's not an error. That's just a message.

The point of this message is, sometimes users will input a column called "treatment", with levels 1,2,3,4, etc. And this column can easily be a numeric type, using the default R functions for reading in the colData. However, that would be a big modeling mistake, treatment 4 is not (treatment 2)^2, and so on.

So this message says essentially, "are you sure that's what you want to do?". It just prints the message and let's you continue with your analysis.

In R there are 3 levels of communication with the user:

  • message (here is some information)
  • warning (pay attention! but I'll let you continue) and
  • error (something is wrong and I won't let you continue).

These look like:

 > message("hi")
 hi
 > warning("hi")
 Warning message:
 hi
!> stop("hi")
 Error: hi
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Michael Love1.7k
0
gravatar for Chris Gene
2.6 years ago by
Chris Gene60
United Kingdom
Chris Gene60 wrote:

Thank you Michael.
I just realized I hadn't releveled the data at all.

When performing a Wald test, I'm not sure whether I should include the contrast argument. As I understand from the documentation, if I provide the interaction term under name, and the contrast, the latter will take priority and the function will ignore the name.
Perhaps my confusion stems from the fact that I haven't really understood what the last two interaction terms are referring to.

Now I have the following:

    ddsTC <- DESeqDataSetFromMatrix(countData=data, colData=coldata, 
                                design=~ time + genotype:time + genotype)
ddsTC$genotype <- relevel(ddsTC$genotype, ref="wt")

ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ time + genotype)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)

# log2 fold change (MLE): timensc.genotypehet
# LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype'
# DataFrame with 4 rows and 6 columns
# baseMean log2FoldChange lfcSE stat

I have the following terms:

    resultsNames(ddsTC)
## "Intercept"           "time_np_vs_es"       "time_nsc_vs_es"     "genotype_het_vs_wt"  "timenp.genotypehet"  "timensc.genotypehet"
  1. If I use the last term "timensc.genotypehet", is that comparing the last timepoint (nc) with the first, and genotype het vs wt?

  2. When I releveled, it was for genotype. Should I assume that for the time factor it levels based on alphabetical order? From the second and third interaction terms that's what it seems to be doing - which happens to be fine here, given the alphabetical order reflects the time, but in future experiments I would want to know that I am applying the formula correctly.

In another post on cell type, you suggest grouping the factors:
https://support.bioconductor.org/p/67600/#67612
Would that be appropriate here?
What is the difference between grouping the factors, and using a list for the contrast argument?

Thank you
Chris

ADD COMMENTlink written 2.6 years ago by Chris Gene60
3

Don't use 'contrast', just use 'name'. The interaction terms tell you the additional difference between genotypes at a given timepoint, beyond the difference at time=0. E.g. timenp.genotypehet is the additional difference between het vs WT at timepoint NP. These are usually the terms that investigators are interested in, because, if the two genotypes have identical profile over time (except allowing for a shift at all time points), then you may say nothing is "interesting" for that gene.

The grouping factors suggestion you link to doesn't let you focus on the differences between genotypes at a given time point, after accounting for the shift at time=0, which seems to be the interesting question for time series across conditions.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Michael Love1.7k
2

And name="genotype_het_vs_wt" or equivalently contrast=c("genotype","het","WT") gives you the het vs WT difference at time=0. This could be of interest.

ADD REPLYlink written 2.6 years ago by Michael Love1.7k

Thanks Michael. That's very helpful.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Chris Gene60

Thanks Mochael. That's very helpful.

ADD REPLYlink written 2.6 years ago by Chris Gene60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour