DESeq2 design formula considerations
2
0
Entering edit mode
23 months ago
nhaus ▴ 360

Hello,

I am trying to perform a differential gene expression analysis using DESeq2, but I am unsure what design formula I should use. The experimental setup is straight forward. We have two tissues (lung and heart) and for each tissue, we are looking at two genotypes (MT and WT). I am interested in finding differentially expressed genes between the tissues of the same genotype.

My first thought was, that I would simply combine the tissue and genotype to a new variable and then perform differential expression analysis, so that I would ultimately have two contrasts: lung_MT_vs_heart_MT and lung_WT_vs_heart_WT. The corresponding design would simply be design = ~ tissue_genotype.

However, I somehow have the feeling that this is not the correct way to do it. Accounting for the genotype by doing design = ~ genotype + tissue, is also not what I want, because I think that this would give me the differentially expressed genes between the tissues, while simply correcting for the genotype. But I want the differentially expressed genes between tissues for each genotype.

I would very much appreciate if anybody could give me some input if what I am doing is correct.

Cheers!

differential RNAseq expression analysis DGE DESeq2 • 3.8k views
ADD COMMENT
4
Entering edit mode
23 months ago
LauferVA 4.4k

nhaus - This question has been asked many times both here as well as on other fora, for instance the bioconductor forum. Mike Love (the author of the DESeq2 software) has probably answered this question 50-60 times, if I had to guess, so you can find a lot on this subject by googling.

However, to guide you a bit more, I would encourage you to consider the following resources:

  1. The DESeq2 Bioconductor page, which contains many of the rest of the links below.
  2. The DESeq2 Vignette; in particular, the subsection on interactions.
  3. Both the DESeq2 Reference Manual and the Limma Reference Manual.

In addition to these, I want to mention a few terms that you can search and read about that you may not come across in searches like the above...

Background Reading (the basics):

  1. Go to Google or Wikipedia and search first for "Multiple Linear Regression" then for "Main effects versus interaction effects".

If you want to get more technical:

  1. To really (really) understand what is happening in a given model or with a given software, you need to understand what is meant by the terms general linear modeling and then generalized linear modeling.
  2. Once you have familiarity with those two, note that DESeq2 is a form of GLM called Count Regression.
  3. Once you have familiarity with 1 & 2, re-visit the term "Sums of Squares". Not all softwares calculate Sums of Squares (SS) in the same way (SAS and R default settings differed in this regard the last time I checked). There are different ways in which to calculate SS. To provide a quantitative/precise answer to the question you've asked above, you'd need to know a bit about how that's being done.

I know this may seem like minutiae or a pedantic point. Maybe it is, but it can definitely influence your results. As an example of why this could be important to you, consider:

Gene Expression ~ Genotype + Treatment Group + Age

versus

Gene Expression ~ Treatment Group + Age + Genotype

Depending on the type of SS being used (Type I, Type II, Type III etc.) simply the order of the terms alone may or may not influence the effect size estimates (beta, odds ratio) generated ...

OK, final comment. Above, you asked, which is "correct"? Consider the example just above. Which is correct? Well, it depends on exactly what you are trying to study, and exactly what question you are trying to ask.

If you have a chance to review the above and are still unsure which is best for you, please let me know.

ADD COMMENT
1
Entering edit mode

Thank you very much for taking the time to write such a comprehensive answer and providing the pointers for further reading. The _interactions_ section of the DESeq2 vignette was exactly what I needed!

ADD REPLY
0
Entering edit mode

Hello again - glad to hear it.

I think a good test of understanding might be if you can state in words the difference between (for instance):

0.  read count ~ genotype + tissue

1.  read count ~ tissue + genotype

2.  read count ~ (tissue * genotype)

3.  read count ~ genotype + tissue + (tissue * genotype)
ADD REPLY
0
Entering edit mode
23 months ago
LChart 4.3k

But I want the differentially expressed genes between tissues for each genotype.

The simplest solution is to simply run two design = ~ tissue separately for each genotype. This is far more straightforward than trying to build the correct contrasts out of a crossed design ~ tissue * genotype = tissue + genotype + tissue:genotype.

The logFC values are immediately interpretable (and you can plot them against one another, gene for gene); and the loss of power doing it this way versus the full model is really quite small. I almost always use this approach.

ADD COMMENT
1
Entering edit mode

In general, it is not better to literally make two subsets of data; it's better to have the information from all the samples together for normalization and dispersion estimates. It is very easy and readable to compare one subset of samples to another in the first way the OP described

However, the two tissues are so different to each other, in this particular case, I'd separate them.

ADD REPLY
0
Entering edit mode

nhaus - I think both this answer and @swbarnes2's reply have substantial merit. Suppose one wanted to restate the discussion in more abstract terms used in statistical theory. In that case, you might restate this particular answer & comment as "pooled versus stratified study design"

Ok, but ... which is better?

In general, it is possible for a pooled study design to have better statistical power than a stratified analysis or the reverse. For whole transcriptome studies using {gene expression microarray OR RNAseq OR scRNAseq}, because research labs are operating on a set budget and because each sample is costly, often the number of samples (biological replicates) per experiment is fairly low ... As such, in this particular context I tend to agree with @swbarnes2 - usually a pooled analysis (with appropriately calibrated contrasts) will outperform a stratified analysis. However, I have seen cases in which the reverse is true.

To read more on this, I would consider a google query such as the following:

"statistical analysis of gene expression data" "pooled versus stratified study design" "statistical power"

ADD REPLY

Login before adding your answer.

Traffic: 1170 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6