Question: edgeR: Correct pipeline for DE analysis with multiple conditions and batches
0
gravatar for misconstruction
3.2 years ago by
European Union
misconstruction20 wrote:

I'm using edgeR to test for DE genes in an expression matrix. For each sample if have a condition and a batch. I want to use the glm-functionality in edgeR to test for DE between conditions, while taking into account batches.

Some example data to show my problem. Say if have a count matrix, EM,  of 10 sample with the following labels:

EM = EM # Imagine matrix of counts here...

conditions = c("con1", "con1", "con1", "con2", "con2", "con2", "con3", "con3", "con3", "con3")

batches = c("batch1", "batch2", "batch3", "batch1", "batch2", "batch3", "batch1", "con2", "con3", "con3")

The pipeline could look like this:

dge = DGEList(counts=EM) # Create object

dge = calcNormFactors(dge, method='TMM') # Normalize library sizes using TMM

design = model.matrix(~0+conditions+batches) # Create design matrix for glm

colnames(design) = c(levels(conditions), levels(batches)[2:length(levels(batches))]) # Set prettier column names

dge = estimateGLMCommonDisp(dge, design) # Estimate common dispersion

dge = estimateGLMTagwiseDisp(dge, design) # Estimate tagwise dispersion

fit = glmFit(dge,design) # Fit glm

pair_vector = sprintf("%s-%s", "con1", "con3") # Samples to be compared
pair_contrast = makeContrasts(contrasts=pair_vector, levels=design) # Make contrast

lrt = glmLRT(fit, contrast=pair_contrast) # Likelihood ratio test

 

My questions:

1) The design matrix: There is no baseline conditions, so I remove the intersect with the 0+. Is this necessary to do for batches as well, even though they are not directly used as contrasts?

2) Does the glmFit take into account norm-factors for library sizes?

rna-seq edger de R • 6.9k views
ADD COMMENTlink modified 3.2 years ago by Devon Ryan73k • written 3.2 years ago by misconstruction20

Do you want to just compare "con3" and "con2" vs. "con1" or do all of the pairwise comparisons? Your setup is more targeted toward comparing things versus con1 and including an intercept rather than using contrasts.

ADD REPLYlink written 3.2 years ago by Devon Ryan73k

I want to test all pairwise comparisons between conditions (no condition can be considered baseline'). I omitted the for-loop for clarity.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by misconstruction20
4
gravatar for Devon Ryan
3.2 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

There are two ways to go about this. Firstly, either allow an intercept and then just use a contrast for the cond3 to cond2 comparison, or don't allow an intercept and group things different (e.g., group=sprint("%s.%s",batch,condition)).

If we just allow an intercept then:

design <- model.matrix(~conditions+batches)
dge = estimateGLMCommonDisp(dge, design)
dge = estimateGLMTagwiseDisp(dge, design)
fit = glmFit(dge,design)
lrt2vs1 <- glmLRT(fit, coef=2)
lrt3vs1 <- glmLRT(fit, coef=3)
lrt3vs2 <- glmLRT(fit,contrast=c(0,-1,1,0,0))

BTW, have a read through the edgeR user's guide, particularly section 3.4.2, which has an almost identical example (it's a really well done user's guide).

ADD COMMENTlink written 3.2 years ago by Devon Ryan73k

Thanks for the response! I would like to directly use the names of conditions when doing the testing, rather than using an Intercept column and coef - how would you go about that?

ADD REPLYlink written 3.2 years ago by misconstruction20

You can use names for the coef= argument, but I believe that contrast only takes numeric vectors and matrices. You should be able to use my example to determine how to do one without an intercept.

ADD REPLYlink written 3.2 years ago by Devon Ryan73k

Hi Devon

I found a interesting but naive staff.

When I  use 

design1 <- model.matrix(~conditions) # with Intercept column

d <- estimateDisp(d, design, robust=TRUE)

fit <- glmFit(d, design)

lrt <- glmLRT(fit)

and 

design2 <- model.matrix(~0+conditions) # without Intercept

e.g. design2 like
  conditionN conditionT
1          1          0
2          1          0
3          0          1
4          0          1

 

The design2 result is much more liberal than design1.

Why?

Which do you recommend? (I don't have muti-factor, the comparison is based on same timepoint)  .Thank you!

ADD REPLYlink modified 22 months ago • written 22 months ago by super20

They end up being identical, though you MUST explicitly specify the contrast with the second method.

ADD REPLYlink written 22 months ago by Devon Ryan73k

I got. design1  is 

lrt <- glmLRT(fit)

coef = last column(default),

design2 is sth like

lrt <- glmLRT(fit,contrast=(-1,1))

Am I right?

ADD REPLYlink written 22 months ago by super20

That looks correct at least

ADD REPLYlink written 22 months ago by Devon Ryan73k

I made a quick script for this.

pdacDGE <- function(group,ref.groups,comp.groups) {
  contrast <- c(0,0,0,0,0,0,0,0,0,0,0,0,0)
  for (i in 1:length(ref.groups)) {
    contrast[which(levels(group) == ref.groups[i], arr.ind = TRUE)] <- -(1/length(ref.groups))
  }
  for (j in 1:length(comp.groups)) {
    contrast[which(levels(group) == comp.groups[j], arr.ind = TRUE)] <- 1/length(comp.groups)
  }
  print(contrast)
  lrt<-glmLRT(glm.fit,contrast = contrast)
  tt<-topTags(lrt, n=500000, p.value = .05, sort.by = "logFC")
  print(contrast)
  print(length(rownames(tt)))
  return(tt)
}
ADD REPLYlink modified 7 months ago • written 7 months ago by dcrowe130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 704 users visited in the last hour