Question

Batch effect in DESeq2 - multiple factor or SVA?

2

Entering edit mode

8.6 years ago

Chris Gene ▴ 80

I have RNASeq data that I am analysing using DESeq2. I have 2 genotypes (WT and mutant). The RNASeq was run in 3 separate batches (Batch1, 2 and 3; all batches contained some WT and mutant samples). I ran DESeq using design=~genotype + batch, re-leveled to genotype=WT. I then tried including surrogate variables based on the batch using SVA as follows below. I get a different number of DE genes in each analysis.

When I run a PCA and sample distances, the samples cluster more by batch than by genotype (particularly the 3rd batch where I know the quality of the RNA was poor).

I'm not sure which results I should be actually using. It seems that using SVA is a better way to exclude technical variation. Am I correct? When applying SVA, should the batch be included in the original design, or just the genotype?

I appreciate any help! Thanks in advance!

dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ genotype, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
# output:
# Number of significant surrogate variables is: 2

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + genotype

ddssva <- DESeq(ddssva)
ressva <- results(ddssva,contrast=c("genotype","HET","WT"))

DESeq2 SVA RNA-Seq • 13k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.6 years ago by Chris Gene ▴ 80

0

Entering edit mode

Thank you for your reply.

As I understand, you are suggesting I use ~ genotype,svseq instead of genotype, colData in the model matrix as I did above?

The different number is something like 40 to 400. The numbers aren't huge to start, but it still is a 10 fold difference. Is that unreasonable?

Many thanks for your help

best wishes
Chris

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by Chris Gene ▴ 80

0

Entering edit mode

Hi Vivek,

Sorry to bother again, but when I try to do the above it won't let me add svseq to the design.

error:

> design(ddssva) <- formula(~ genotype + svseq)
Error in validObject(object) :
  invalid class "DESeqDataSet" object: all variables in design formula must be columns in colData

My colData contains the information on sample, genotype and batch:

     row.names    sample    genotype    run
1    WT1    WT1    WT    run1
2    WT2    WT2    WT    run1
3    WT3    WT3    WT    run1
4    WT5    WT5    WT    run2
5    WT6    WT6    WT    run2
6    WT9    WT9    WT    run3
7    HET1    HET1    HET    run1
8    HET2    HET2    HET    run1
9    HET3    HET3    HET    run1
10    HET4    HET4    HET    run2
11    HET5    HET5    HET    run3
12    HET7    HET7    HET    run3

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by Chris Gene ▴ 80

1

Entering edit mode

Yes, since I assumed a very simple colData design for demo. I updated my answer now. But you don't need to put "sample" as a column in there.. You wouldn't need even the "Run" column if you are estimating batches in unsupervised way.

ADD REPLY • link 8.6 years ago by vivekbhr ▴ 690

0

Entering edit mode

Thank you Vivek. Very helpful.

ADD REPLY • link 8.6 years ago by Chris Gene ▴ 80

Ram · Answer 1 · 2015-09-02

Update: After looking at your colData.

Create matrix for svaseq like this:

mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))

Then do unsupervised svaseq

svseq <- svaseq(dat, mod, mod0, n.sv=2)

Now use this as input to DESeq design, along with genotype

dds$SV1 <- svseq$sv[,1]
dds$SV2 <- svseq$sv[,2]
design(dds) <- ~ genotype + SV1 + SV2

I wouldn't expect as much difference as you told, if you implement both methods properly..

Ram · Answer 2 · 2015-09-02

Estimating batches by svaseq followed by providing covariates in the DESeq design is expected to perform either equally good or better than adding batch as a factor in glm.. So design matrix for deseq could be something like:

model.matrix(as.formula(paste0("~ genotype",svseq))

where svseq is the output from svseq. Then you can continue with DESeq ..

Different number of significant genes is expected since the covariate estimation in different in svseq vs blocking design (multifactor glm). I wouldn't expect too much difference though..