Question: Batch effect in DESeq2 - multiple factor or SVA?
1
gravatar for Chris Gene
3.6 years ago by
Chris Gene60
United Kingdom
Chris Gene60 wrote:

I have RNASeq data that I am analysing using DESeq2. I have 2 genotypes (WT and mutant). The RNASeq was run in 3 seperate batches (Batch1, 2 and 3; all batches containd some WT and mutant samples). I ran DESeq using design=~genotype + batch, releveled to genotype=WT. I then tried including surrogate variables based on the batch using SVA as follows below. I get a different number of DE genes in each analysis.

Whe I run a PCA and sample distances, the samples cluster more by batch than by genotype (particulalrly the 3rd batch where I know the quality of the RNA was poor).

I'm not sure which results I should be actually using. It seems that using SVA is a better way to exclude technical variation. Am I correct? When applying SVA, should the batch be included in the original design, or just the genotype?

I appreciate any help! Thanks in advance!

 

dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ genotype, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
# output:
# Number of significant surrogate variables is:  2

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + genotype

ddssva <- DESeq(ddssva)
ressva <- results(ddssva,contrast=c("genotype","HET","WT"))

 

 

rna-seq deseq2 sva • 6.9k views
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Chris Gene60
2
gravatar for vivekbhr
3.6 years ago by
vivekbhr510
Germany
vivekbhr510 wrote:

Update: After looking at your colData.

Create matrix for svaseq like this:

mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))

Then do unsupervised svaseq 

svseq <- svaseq(dat, mod, mod0, n.sv=2)

Now use this as input to DESeq design, along with genotype

dds$SV1 <- svseq$sv[,1]
dds$SV2 <- svseq$sv[,2]
design(dds) <- ~ genotype + SV1 + SV2 

I wouldn't expect as much difference as you told, if you implement both methods properly..

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by vivekbhr510
1
gravatar for vivekbhr
3.6 years ago by
vivekbhr510
Germany
vivekbhr510 wrote:

Estimating batches by svaseq followed by providing covariates in the DESeq design is expected to perform either equally good or better than adding batch as a factor in glm.. So design matrix for deseq could be something like:

model.matrix(as.formula(paste0("~ genotype",svseq))

where svseq is the output from svseq. Then you can continue with DESeq ..

Different number of significant genes is expected since the covariate estimation in different in svseq vs blocking design (multifactor glm). I wouldn't expect too much difference though..

ADD COMMENTlink written 3.6 years ago by vivekbhr510
0
gravatar for Chris Gene
3.6 years ago by
Chris Gene60
United Kingdom
Chris Gene60 wrote:

Thank you for your reply.

As I understand, you are suggesting I use "~ genotype",svseq instead of genotype, colData in the model matrix as I did above?

The different number is something like 40 to 400. The numbers aren't huge to start, but it still is a 10 fold difference. Is that unreasonable?

Many thanks for your help

best wishes

Chris

 

ADD COMMENTlink written 3.6 years ago by Chris Gene60
0
gravatar for Chris Gene
3.6 years ago by
Chris Gene60
United Kingdom
Chris Gene60 wrote:

Hi Vivek,

 

Sorry to bother again, but when I try to do the above it won't let me add svseq to the design.

error:

> design(ddssva) <-  formula(~ genotype + svseq)
Error in validObject(object) :
  invalid class “DESeqDataSet” object: all variables in design formula must be columns in colData

 

My colData contains the information on sample, genotype and batch:


     row.names    sample    genotype    run
1    WT1    WT1    WT    run1
2    WT2    WT2    WT    run1
3    WT3    WT3    WT    run1
4    WT5    WT5    WT    run2
5    WT6    WT6    WT    run2
6    WT9    WT9    WT    run3
7    HET1    HET1    HET    run1
8    HET2    HET2    HET    run1
9    HET3    HET3    HET    run1
10    HET4    HET4    HET    run2
11    HET5    HET5    HET    run3
12    HET7    HET7    HET    run3

ADD COMMENTlink written 3.6 years ago by Chris Gene60
1

Yes, since I assumed a very simple colData design for demo. I updated my answer now. But you don't need to put "sample" as a column in there.. You wouldn't need even the "Run" column if you are estimating batches in unsupervised way.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by vivekbhr510

Thank you Vivek. Very helpful.

ADD REPLYlink written 3.6 years ago by Chris Gene60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 645 users visited in the last hour