Question: [edgeR usage] Comparing categories of fewer input variables for differential gene expression
2
moxu460 wrote:

Suppose you have tumor and normal samples treated with compound A with 0, 1, 2 mM for 0, 1, 2 hours, and you want to do DGE analysis using edgeR. You want to answer the question "which genes are differentially expressed in response to compound A": after taking into account of the impact of disease condition, time of treatment, and concentration of A (noted as [A]), you want to focus on the relationship of treatment by A and gene expression, and don't care about disease condition, time of treatment, not even [A].

How do you do it in edgeR?

In R, assuming the log(expression) being normally distributed, you could simply do glm(log(expression) ~ disease_condition + [A] + time. Then you only need to check the slope of [A] and its p-value, right? But since log(expression) is not normally distributed, we cannot use glm in R.

While in edgeR, things seem to be much more complicated. First, both time and [A] cannot be considered to be continuous (correct me if I am wrong); Second, I don't know how to do the DGE analysis.

For simplicity, let's forget about time in the problem -- just consider disease condition and [A] (0 & 1 mM), each with a replicate. Your gene count file looks like the following (T for tumor, N for normal; A0 for [A] = 0, A1 for [A] = 1; column 2~9 are gene counts for 8 samples, one column for one sample):

``````GENE T_A0 T_A0 T_A1 T_A1 N_A0 N_A0 N_A1 N_A1
GENE_A ...
GENE_B ...
...
``````

You will have four levels: T_A0, T_A1, N_A0, N_A1.

Your code snippet looks like the following:

``````y <- DGEList(counts=d, group=grp);
y <- calcNormFactors(y);
dsgn <- model.matrix(~0+grp); # not allowing intercept
y <- estimateDisp(y, dsgn);
fit <-glmFit(y, dsgn)
``````

The question is how to construct your test (or contrast). Assuming the levels are ordered as above, what would be your contrast? c(-0.5, 0.5, -0.5, 0.5) or something else? Or is there a better approach? Or am I terribly wrong?

rna-seq next-gen R • 1.8k views
modified 3.2 years ago • written 3.2 years ago by moxu460
2
Devon Ryan95k wrote:

FYI, you presumably mean `lm()` rather than `glm()`, since edgeR uses a glm (with a bit of empirical bayes added on for power).

Anyway, if you don't think there's any interaction between tumor status and the treatment then it's easier to allow an intercept:

``````foo = data.frame(status=factor(c("T","T","T","T","N","N","N","N")), treatment=factor(c(0,0,0,0,1,1,1,1)))
dsgn = model.matrix(~status + treatment, data=foo)
y = calcNormFactors(y); # global normalization
y = estimateDisp(y, dsgn);
fit = glmFit(y, dsgn)
lrt = glmLRT(fit, coef=3) # treatment1
``````

If you wanted to omit the intercept and instead use groups (as in your example), then the contrast is `c(-0.5, 0.5, -0.5, 0.5)`. That's easier to understand if one instead uses `makeContrasts((T_A1 + N_A1)/2 - (T_A0 + N_A0)/2), grp)` (i.e., "give me the average of the treated minus the average of the untreated)...but maybe that's just me.

In reality, you're probably interested in the interaction.

You are simply awesome! Knowledgable, humble, and willing to help! A golden combination. :)

A few more questions:

• Why interaction discourages the use of intercept? Is it a general statistics concept? I have not learned this in school.

• Why it's easier to allow an intercept? I am wrapping edgeR in a perl script. If an intercept is used, I need to figure out which group is used as the baseline, which would be one more thing to worry about.

• By agreeing upon contrast = c(-0.5, 0.5, -0.5, 0.5), you've approved my methodology?

• How about simply use the "glm.nb" (negative binomial glm) function in library "MASS" to do glm.nb(y ~ disease + [A] + time)? It seems to be much easier and it treats [A] as a continuous variable instead of a categorical one, and thus more powerful?

Thanks again.

2

Have you actually tried to use glm.nb() to analyze RNA-seq data? That would be much more complicated than edgeR as well being less powerful and increasing the false discovery rate.

Because glm.nb() does not use bayesian shrinkage so it's not as powerful. This part I can understand. But interns of ease of use, for glam.nb() you could simply do glm.nb(y ~ disease + [A] + time) or (y ~ factor(disease) + [A] + time) if you want to treat disease as a categorical variable, just like the way you use "glm". For edgeR, you will have to assign a group to each sample when you generate "model.matrix", you will have to worry about "levels", and be very careful about which "coef" corresponds to which category, and up till now, I am still trying to figure how to to deal with a continuous variable.

If edgeR is more powerful, that's a big enough reason to use it. I am just wondering why you guys think edgeR is easier. :) Sorry, no offence, just curious.

If you went the glm.nb() route you'd first need to `apply()` all of that and then figure out how to deal with the size factor normalization (I know for a fact that I'd screw that up a few times before getting it right). I guess I see edgeR as being just like glm.nb in terms of usage. glm.nb is just making a model matrix internally, which has its pros and cons (I assume one can also pass in a model matrix).

1
1. Using an intercept doesn't discourage using an interaction, in fact it generally makes it more straight forward to determine the interaction results.
2. Most biologists (myself included) were never taught the concept of a contrast matrix in the one statistics course that we took as undergrads. Since using an intercept puts everything in the ANOVA context in which most of us were taught statistics I've found it simpler for most people to use intercepts and simply test coefficients.
3. Yes
4. `glm.nb()` has much less power. You can treat concentration as continuous too in edgeR; in your example there was only one concentration of `A` so there was no difference between categorical/continuous. One downside to using a continuous variable for `A` is that it's saying that the response to `A` is linear with dose. Having spent my undergrad years working in pharmacology I've found that's typically not the case.

If your exemplar code, if you want to take "treatment" as continuous, would you use "treatment=c(0,0,0,0,1,1)" and omitting the "factor"? Also, would you do change anything else? For instance, would "coef=3" in the lrt line still hold?

Thanks!

1

Correct, I just have a habit of making everything a factor. I don't think anything else would need to be changed, you could look at the output matrix to double check that (namely, that the third coefficent (column name) is what you expect). Things become different when you add in more concentrations and there I would generally prefer not to use a continuous variable for concentration...since there's likely not a linear dependency (but if you happen to expect otherwise then go for it).

Finally get things working! Feel so happy!

However, still a bit confused about when interaction is involved:

• If the design matrix is ~ status * treatment, (assuming status is defined as categorical and treatment as continuous in the "foo" definition, which is a little different from your treatment definition, but I guess it's OK), you will see four terms in the model: ~ intercept + status + treatment + status:treatment. In such a case, if you use "coef=4", you are comparing the interaction term with intercept after offsetting the impact of status & treatment, right?

• If the design matrix is ~ 0 + status * treatment, you will see four terms in the model as well: ~ factor(status)T + factor(status)N + treatment + factor(status)T:treatment. If you want to find out the DE genes with status:treatment interaction, how could you construct your contrast (or coef is possible)? contrast = c(-0.5, 0, -0.5, 1)? We cannot use coef = 4, right? I tried several contrast compositions but the results are drastically different from those obtained through the method allowing intercept described above.

• BTW, what does logFC in the edgeR output mean for a continuous predictor variable? logFC seems to be a pair-wise comparison term and only applies to a categorical variable, so it does not make perfect sense to be applied to a continuous variable.

Thanks!

1
1. You're not comparing the interaction term, but rather the sum of `status` and `treatment` to what you actually see when both are combined. Interactions look at additional effects of combinations of two or more conditions/covariates.
2. That's a good way to confuse yourself and is the exact reason I encourage you to phrase things in terms of a factorial experiment, it makes it easier to ask the questions you want to. For what the actual contrast would be in that case, let's label your groups `A`, `B`, `C`, and `D`, in the order you presented them. Then an interaction is `(D - C) - (A - B) = D - C + B - A` (i.e., the difference of group differences), so `c(-1, 1, -1, 1)`...but that's making life more difficult than needed. In section 3.3.1 of the edgeR user guide they give an example of an experiment somewhat similar to your full one, so have a look at that.
3. For all GLMs (`glm.nb()`, edgeR, DESeq2, limma, etc.), log2FC for a continuous variable is relative to the "unit change" in that variable. As an example, suppose you've expose a cell line to some chemical at concentrations of 0mM, 1mM, 2.5mM, 5mM, ... 50mM and do `expression ~ concentration`. The log2FC is per millimolar change in concentration. This is important to keep in mind, since we're doing everything on computers and they don't have infinite precision. It's best to keep your units in some sort of biologically meaningful range, which is less an issue for concentrations and tends to be more of an issue for things like age or weight.

Your replies have always been very helpful. The most difficult part for me is contrast -- not sure how to construct one sometimes. However, my question today is not about contrast but about logFC.

If you model is: ~ A * factor(B), and you test "any" interaction, then you will have multiple logFC for different categorical values of B. If you need to do GeneSet Enrichment Analysis from Broad Institute, which requires ranking the genes from high score to low score so that up- or down- regulated genesets can be sorted out, then you need a metrics to rank the genes. Typically, I use FDR to filter genes in, and then rank them by logFC. A major benefit of using logFC is that it has signs which suggests up- or down- regulation. Since there are multiple logFC related to factor B, what metrics shall we use to rank the genes for GSEA? They don't always agree with each other in terms of signs (directions).

Thanks!

Each coefficient has its own biological meaning, so perhaps you want to do GSEA separately on each.

The 2nd bullet point, "contrast=c(1, -1, -1, 1)" generated results similar (but not identical) to those generated in the 1st bullet point.

But still, what does logFC mean for a continuous variable?