Question: edger model matrix design of linear model
2
gravatar for teabonng
10 months ago by
teabonng20
teabonng20 wrote:

Hi,

I am trying to do differential expression based on RNA-seq data. I only have one variable, which is the timepoint. I want to analyze gene expression relative to the 0 h time point. I already set up the contrasts. Now, I am trying to set up the design matrix. What is the difference between

design <- model.matrix(~timepoint)

and

design <- model.matrix(~0 + timepoint).

There is a big difference in the results.

ADD COMMENTlink modified 10 months ago • written 10 months ago by teabonng20

I don't see a difference. I could be wrong.

Why not try both and compare the results?

ADD REPLYlink modified 10 months ago • written 10 months ago by goodez460

Hi, there is a big difference in the results. The results using the design <- model.matrix(~timepoint) makes much more biological sense.

ADD REPLYlink written 10 months ago by teabonng20

Yes I don't think model.matrix(~0 + timepoint) makes much sense. How can you use zero as a term in your model to explain differences?

ADD REPLYlink modified 10 months ago • written 10 months ago by goodez460
4
gravatar for andrew.j.skelton73
10 months ago by
London
andrew.j.skelton735.6k wrote:

The clue here is to look at the coefficients, or columns derived in the design matrix. Your non intercept model design, will create a coefficient that describes the difference between the two levels of your timepoint variable. Your non-intercept model will create a coefficient for each level. Here's an example using your non-intercept model:

library(tidyverse)
library(limma)
# Make a Phenotype Table
foo       <- data.frame(SampleType = paste0(rep(c("A","B"), each = 3)), 
                        Reps       = rep(1:3, 2)) %>%
             mutate(Names = paste0(SampleType,Reps))
# Make some fake gene expression data
genes.foo <- matrix(rnorm(6*100),ncol = 6) %>% 
             `rownames<-`(paste0("Gene_",1:100)) %>% 
             `colnames<-`(foo$Names)

#non-intercept model
model.foo <- model.matrix(~0 + SampleType, data = foo) %>% 
             `colnames<-`(gsub("SampleType","",colnames(.)))

model.foo
            A           B
1           1           0
2           1           0
3           1           0
4           0           1
5           0           1
6           0           1


conts.foo      <- c("A_Vs_B" = "A-B")
contmap.foo    <- makeContrasts(contrasts = conts.foo, levels = colnames(model.foo)) %>% 
                  `colnames<-`(names(conts.foo))
fit.foo     <- lmFit(genes.foo, model.foo) %>% 
               contrasts.fit(contmap.foo) %>% 
               eBayes


> fit.foo$coefficients %>% head
        Contrasts
             A_Vs_B
  Gene_1 -0.6722254
  Gene_2 -1.1571439
  Gene_3 -0.1021150
  Gene_4  0.8687436
  Gene_5 -0.9600460
  Gene_6  1.3139228

topTable(fit.foo, coef = "A_Vs_B", number = Inf, p.value = 0.05, lfc = log2(1.5))

Alternatively, here's your intercept model

model.foo <- model.matrix(~SampleType, data = foo)

fit.foo     <- lmFit(genes.foo, model.foo) %>% 
               eBayes

> model.foo
  (Intercept) SampleTypeB
1           1           0
2           1           0
3           1           0
4           1           1
5           1           1
6           1           1

# Here SampleTypeB is actually the difference relative to the first level (SampleTypeA)
> fit.foo$coefficients %>% head
       (Intercept) SampleTypeB
Gene_1  -0.2920748   0.6722254
Gene_2  -1.0356549   1.1571439
Gene_3   0.2477702   0.1021150
Gene_4   0.5016585  -0.8687436
Gene_5   0.3153554   0.9600460
Gene_6   0.2836491  -1.3139228

topTable(fit.foo, coef = "SampleTypeB", number = Inf, p.value = 0.05, lfc = log2(1.5))

edit: Changed the data frame initialisation.

ADD COMMENTlink modified 10 months ago • written 10 months ago by andrew.j.skelton735.6k

in line 4, it should be: foo <- data.frame(SampleType = paste0(rep(c("A","B"), each = 3)), Reps = rep(1:3, 2), Names = paste0(rep(c("A","B"), each=3), rep(1:3, 2) ) )

set.seed() before generating the data will be helpful.

ADD REPLYlink modified 10 months ago • written 10 months ago by Zhilong Jia1.4k

You're right, the data frame initialisation was wrong. I mostly wrote this as pseudocode and ran bits in R...I probably should have written that as a disclaimer. Setting the seed has benefits in reproducibility, but as it's proof of concept, and dummy data, it doesn't really matter here.

ADD REPLYlink written 10 months ago by andrew.j.skelton735.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour