Multivariable Design in EdgeR (Design matrix / model.matrix) - Tilde Use
Entering edit mode
12 weeks ago
Noah E. ▴ 20

I am using edgeR for the first time and am not confident on how to set up the design matrix for the analyses in an experiment/dataset with >1 variables.

Let's say I have variables a, b, c, and d. There are 2 possibilities for each of the 4 variables (therefore 8 distinct groups total) I would want to investigate the following:

  1. The effect of each variable independently (e.g. a1 v a2, b1 v b2, this would result in 4 separate comparisons)
  2. The effect of each of the variables altogether (e.g. comparing across all 8 of the possible groups).

Based on page 14 of the EdgeR guide, it seems like one may use the tilde (~) to specify something that may contribute to a difference in count values but is not the primary variable that you would want to investigate. Other posts, however, suggest that the tilde in R is used to separate the dependent variable from the independent variable.

What is the use of the tilde in the model.matrix function/design in both of the circumstances that I would want above? For further illustration:

  1. I want to see the effect of and want to 'remove' the potential mediators of b, c, and d. Would I use the following? This is more or less an educated guess.
design <- model.matrix(~ b + ~c + ~d + a)
  1. I want to see the effects of a, b, c, and d altogether. Would I use the following (again, this is little more than a guess, showing that no tildes on any of the variables indicates they should all be evaluated)
design <- model.matrix(a + b + c + d)
edgeR R • 336 views
Entering edit mode

The tilde is part of R formula notation. See ?formula for more information.

There isn't enough information about your variables and experimental design to give good advice. Can you provide more information?

Entering edit mode

Sure thing! In the attached experimental design, I would really want to see:

Experimental Design

  • Differential expression of genes between WT and KO
  • Difference of puromycin selection
  • Difference between timepoints

For each of the abo\ve, would it be better to look at each 'pair-wise' or 'groupwise'? For example, for WT versus KO, should I look at 1 v. 4, 2 v. 5, and 3 v. 6 OR would it be better to compare across all 3 for each (1-3 versus 4-6)?

Also, is there any way to further understand the interaction between variables with the experimental design outlined above? For instance, is there any means to evaluate differences across all of the groups? I have done so in the past with DESeq2 analyses.

Thank you!

Entering edit mode

For further context, page 22 of the EdgeR guide shows the following means to "find genes differences between any of the three groups"

qlf <- glmQLFTest(fit, coef=2:3)

Would this means of calculating differential expression (namely only accounting for it in the coef parameter) be considered sufficient if no proper design matrix is made earlier on? Also, how can I appropriately leverage the coef parameter with, say, 6 groups?

Entering edit mode

For your first question your regression formula will be ~ Genotype + Puromycin + Time. If you code the Genotype column as a factor with WT as the base level the default contrast will be the answer to your first question.

For question two, do you want the effect of puromycin selection independent of genotype and time? If not, what question do you want answered more specifically?

For question three are you looking for the difference in effect of puromycin over time in WT versus KO?

Entering edit mode

Thank you for the reply!

  1. For question 1: your proposed model.matrix (with ~Genotype + Puromycin + Time): does this compare #1 or #2 within the picture? In other words: Is this saying:
  • "Compare all of the WT genotype samples with all of the KO genotype samples, regardless of Puromycin treatment and/or time (option 1)
  • "Compare each WT genotype sample with its corresponding KO genotype sample (option 2)"

Experimental Design

I gather it is option #1 but wanted to be 100% sure. I also recognize there may be a third option that I am not fully considering. I guess I do not understand how the +Puromycin + Time works.

  1. I would really want to know how to set it up both independently and while considering the implications of genotype and time. I think both analyses may be useful to have. Continuing with the example above, I would want to know how to set up 'option 1' and 'option 2' (or whatever other option you think is most appropriate).

  2. Question 3: I would want to see how the '0d' samples compare to the '8d' samples. Again, I would want to be able to know how to perform 'option 1' (all 0d versus all 8d) and 'option 2' (pairwise comparison).

I hope this helps! I apologize for my inexperience.


Login before adding your answer.

Traffic: 2608 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6