Help with linear regression model design
2
0
Entering edit mode
4.3 years ago
mforde84 ★ 1.3k

Hi,

I have RNAseq for some liver metastases which have been classified into three independent groups based upon their expression profiles.

I did pairwise differential expression analyses on these groups using both DESeq and limma, e.g., grp1 - grp2, grp1 - grp3, and grp2 - grp3. There were no significant DE following multiple testing corrections for either of the methods (not necessarily unexpected).

I then tried a GLM analysis with limma's lmFit() using the following design, e.g.,:

groups <- factor(c(1,1,1,2,2,2,3,3,3))
design <- model.matrix(~ 0 + groups)
contrasts <- makeContrasts(group1, group2, group3, group1 - group2, group1 - group3, group2 - group3, levels=design)
fit <- lmFit(data, design)
fit2 <- lmFit(fit, contrasts)
fit2 < eBayes(fit2)


When looking at results, all of the group by group contrasts (e.g., group1 - group2) are consistent with the pairwise tests showing no significantly differential expressed genes. However, when I look at the individual group coefficients, e.g.,:

topTable(fit2, coef="group1")


Nearly every gene is strongly differentially expressed. For example, for group1 nearly 90% of gene annotations showed adjusted pvalues less than 0.05. So I've come to the conclusion that I'm misinterpreting what these individual group coefficients mean, that my design is incorrect, and that I'm not reporting the right statistics. After all, how can group1 not be statistically different from group2 or group3, but strongly different from the combination of group2 + group3?

limma differential expession • 1.9k views
3
Entering edit mode
4.3 years ago
Steven Lakin ★ 1.5k

It's nice that you were able to see that the pairwise analyses in DESeq and limma are equivalent to regression and then fitting on the contrasts by hand; they are exactly the same process, only DESeq and limma handle them for you in the background if you specify the model in the way you originally did it, therefore they should give the same results.

For the single group analysis, in order to understand why "nearly every gene is strongly differentially expressed," we have to look at what exactly we are doing by "fitting contrasts." In short, there is a reason why contrasts are constructed pairwise: they are meant to sum to zero (the first term is represented by a 1 in a vector and the second term by a -1 in the same vector). From an algebraic sense, what we are doing is subtracting all data described by the second term from all data described by the first term. If the first term effect is larger than the second term, we end up with a positive number; if the second term effect is larger than the first, we end up with a negative number. If the difference between the two terms is large, we have a significant effect difference (differential gene expression); if the difference between the two terms is small, we have an insignificant effect difference.

However, if we only set one term equal to 1 (as you did by calling "group1"), all we are doing is subtracting nothing from that term, so we end up with all of the data modeled by group1. By definition, if these genes are different from zero, they will be significantly expressed, since we are comparing them to zero gene expression.

Here is a table that shows how the vectors for contrasts look like (vectors are rows, they sum to zero, read more on wikipedia if interested):

g1   g2   g3   g4
1   -1    0    0
0    0    1   -1
1    1   -1   -1


Row 1 is: g1 - g2

Row 2 is: g3 - g4

Row 3 is: avg(g1 + g2) - avg(g3 + g4)

In contrast, by calling "group1", it probably constructed this vector:

g1    g2    g3
1     0     0

0
Entering edit mode

Ah, that makes sense. So essentially I'm just looking at an intercept for each group.

One more follow up if you don't mind.

In order to capture the combined group contrasts (e.g., between g1 - (g2 + g3)) I've tried creating multiple additional factors which have the appropriate group memberships:

groups <- factor(c(1,1,1,2,2,2,3,3,3))
g1 <- factor(c(1,1,1,2,2,2,2,2,2))
g23 <- factor(c(2,2,2,1,1,1,1,1,1))
g2 <- factor(c(2,2,2,1,1,1,2,2,2))
g13 <- factor(c(1,1,1,2,2,2,1,1,1))
g3 <- factor(c(2,2,2,2,2,2,1,1,1))
g12 <- factor(c(1,1,1,1,1,1,2,2,2))
design <- model.matrix(~ 0 + groups + g1 + g23 + g2 + g13 + g3 + g12)
contrasts <- makeContrasts(group1 - group2, g1 - g23, levels=design)
fit <- lmFit(data, design)


However, I get an error from lmFit saying the coefficients for the new factors are not calculable. I assume this is because they are dependent. I'm not sure how to set up the contrasts without them though. The only time I got reasonable results was when I used one binary factor for a single comparison that included an intercept:

g1 <- factor(c(1,1,1,2,2,2,2,2,2))
design <- model.matrix(~ 1 + g1)


I'm not sure his is a correct design either. What am I overlooking with respect to model.matrix and makeContrasts?

0
Entering edit mode

A very nice and clear answer indeed

0
Entering edit mode
g1    g2    g3
1     0     0


Yes, that is correct. I tried:

g1    g2    g3
1     -1    -1


But that gave incorrect results. As you noted before it doesn't sum to 0.

0
Entering edit mode

I figured it out. To do the comparison you would do something like this:

makeContrasts(group1 - (group2 + group3) / 2, ....)

Thanks again for the help.

0
Entering edit mode
4.3 years ago
mforde84 ★ 1.3k

sorry meant as comment

0
Entering edit mode

What you did in your very first design (not this answer), where you found no DE genes, was correct. What you've done above is incorrect; stick with your intuition on this one. It's not uncommon for groups to be insignificant on DE analysis, especially with cancer expression analysis.

0
Entering edit mode

I agree that the pairwise test is much more straightforward both in terms of implementation and even interpretation, though my bosses are specifically asking for the contrasts between say g1 - (g2 + g3). I don't think the results will really change but I just do what they tell me to do and report the results. I mean it's not that big a deal for me to run 3 separate analysis for each combination comparison, I was just curious if there was a more streamlined way of doing this with one test. Either way, I appreciate the help on this.