Question

Design and contrast matrices for including batch effect

0

Entering edit mode

3.3 years ago

kris275 ▴ 10

Hello everyone,

I am working on a RNA-seq project that includes samples from normal tissue, adenomas, adenocarcinomas, liver metastases and was done in five batches. As I have previously read, it is best if batch effect is included in the design matrix for linear model fitting- shown in limma guide and (http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html).

However, in the latter article, when they write contrasts, the lanes are not included, not accounted for if i understand correctly. A similar story is shown in limma guide chapter 17.1.5, where the DEGs are for the final coefficient.

I performed the analysis in two different ways, the first as such:

group2<- factor(c("N","A","N","A","N","A","N","A","N","A",
                 "C","N","C","N","C","N","C","N","C","N","C","N","C","N","C","N","C","N","C","N",
                 "C","L","N","C","L","N","C","L","N","C","C","C",        "N","L","N","C","L","N","C","L","N","C","L","N","C","L","N","L","N","L","N","L","N","C","L","N","C","L","N","C","L","N","L","N","C","L","N","C","L","N","C","L","N","L","N","C","N","C","L", 
  "N","A","N","A","N","A","N","A","N","A","N","A","N","A","N","A"),levels = c("N","A","C","L"))

lane2 <- factor(c(1,1,1,1,1,1,1,1,1,1,
                 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
                 3,3,3,3,3,3,3,3,3,3,3,3,
                 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
                 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5))

design <- model.matrix(~0+group2+lane2) 

cont.matrix <- makeContrasts(NvsA=N-A, 
                             NvsC=N-C,
                             NvsL=N-L,
                             AvsC=A-C,
                             AvsL=A-L,
                             CvsL=C-L,
                             levels=colnames(design))

The contrast matrix is then

Contrasts
Levels NvsA NvsC NvsL AvsC AvsL CvsL
    N     1    1    1    0    0    0
    A    -1    0    0    1    1    0
    C     0   -1    0   -1    0    1
    L     0    0   -1    0   -1   -1
    L2    0    0    0    0    0    0
    L3    0    0    0    0    0    0
    L4    0    0    0    0    0    0
    L5    0    0    0    0    0    0

and the results include DEGs for NvsA NvsC NvsL AvsC AvsL CvsL. If I understand correctly, this does not account for batch effect and is similar to what the rnaseq-123 article from the link above did?

The other option is similar to the answer to this question https://support.bioconductor.org/p/69374/. I used an interaction type design and averaged the cases from which batch each sample came from and only included the interactions that had a sample inside (otherwise linear model did not work as coefficients were not calculable):

designy1 <- model.matrix(~0+group2:lane2)
colnames(designy1)
colnames(designy1) <- c("N1","A1","C1","L1","N2","A2","C2","L2","N3","A3","C3","L3","N4","A4","C4","L4","N5","A5","C5","L5")
designy2 <- designy1[,c(1:2,5,7,9,11:13,15:18)]

The contrast matrix then looked like this

contr.matrixy <- makeContrasts(NC= (N1+N2+N3+N4+N5)/5-(C2+C3+C4)/3, 
                               nVSa= (N1+N2+N3+N4+N5)/5-(A1+A5)/2, 
                               AVSC= (A1+A5)/2-(C2+C3+C4)/3,
                               NVL= (N1+N2+N3+N4+N5)/5-(L3+L4)/2,
                               CVL= (C2+C3+C4)/3-(L3+L4)/2,
                               AVL=(A1+A5)/2-(L3+L4)/2,
                               levels=colnames(designy2))

I then as standard used voom normalisation, linear model fitting, etc to get to DEGs.

vy <- voom(countsbindlist4I,design = designy2, plot=TRUE)
fity <- lmFit(vy, designy2)
fit2y <- contrasts.fit(fity, contr.matrixy)
efity <- eBayes(fit2y)
dt<- decideTests(efity)
summary(dt)

Is any of the solutions I used okay, or did I miss the mark with the second one? Thank you all for your help and insights.

Best regards

R contrasts RNA-Seq design batch-effect • 1.1k views

ADD COMMENT • link updated 3 months ago by Ram 43k • written 3.3 years ago by kris275 ▴ 10

score 1 · Answer 1 · 2021-01-14

1

Entering edit mode

3.3 years ago

swbarnes2 14k

Illumina lanes generally do not add technical variance, so you shouldn't include them in the design.

ADD COMMENT • link 3.3 years ago by swbarnes2 14k

0

Entering edit mode

Dear swbarnes2,

thank you for your reply. In this case the experiment was done in different batches, the batches were included as "lanes". I could have clarified this better with another variable name, but the batch to batch variability is definitely there- visible with dendrograms and mds.

All help still wanted :)

Thank you all.

ADD REPLY • link 3.3 years ago by kris275 ▴ 10