Question: Batch correction in DESeq2
gravatar for Arindam Ghosh
9 months ago by
Arindam Ghosh300
Arindam Ghosh300 wrote:

For RNA-seq data analysis using DESeq2, a recommended method for batch effect removal is to introduce the batch in the design of the experiment as design = ~ batch + condition.

The presence of batch was already known from experiment design and also detected by PCA biplot on the log transformed raw counts. Post DESeq2 how can we assure that the batches have been correctly taken care of? PCA plot on counts(dds, normalized=TRUE) did not how any changes.

Correcting the batch using ComBat brings out the desired grouping, however, I understand the output from Combat should not be used with DESeq2. Is there any other suggested tool for differential gene expression analysis post batch correction using ComBat?

batch effect combat deseq2 • 1.2k views
ADD COMMENTlink modified 9 months ago by Kevin Blighe63k • written 9 months ago by Arindam Ghosh300
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

You could include batch in the design formula, as you have done, and this will then be 'accounted for' (the statistical inferences will be adjusted for batch) when you perform a differential expression analysis testing for condition. batch is essentially treated as a covariate in the regression model, just as we do in association studies for sex, smoking status, BMI, etc.

If you then want to directly remove the batch effect for downstream analyes, you would do it on the variance-stabilised or rlog expression values using limma::removeBatchEffect(). There was a recent question on Bioconductor: (follow the link to the DESeq2 vignette)

ADD COMMENTlink modified 9 months ago • written 9 months ago by Kevin Blighe63k

Hello Kevin,

I am one of the confused ones regarding removing batch effects. How is 'accounted for' different from 'removal'? When is it correct to use the 'accounted for' (i.e. using the ~batch+cond formula) approach and when is it correct for removal of the batch effect? Thank you for shedding light on these dark areas.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by jfo20

Hello jfo, when we 'account for' the batch effect, batch is being treated as a covariate / confounder in the statistical model. In regression, when we want to adjust for confounders, say smoking status, we usually include them like this in our design formula:

~ gene + smokingStatus

Then, when we derive our test statistics for gene, we known that these test statistics will have been adjusted for the effect of smoking status. The same logic applies when we are adjusting for batch and when we are designing the formula for DESeq2, Limma, EdgeR, etc.

However, in none of this is the underlying data modified. We are merely modifying / adjusting our statistical inferences based on smokingStatus, batch, etc.

So, accounting for batch in this way is only really plausible when we are performing a differential expression analysis, in the case of expression studies.


The above is all well and good until we actually want to take our data and use it downstream. It would make no sense to perform 'machine learning' stuff if we knew that there was a batch effect in the data. The downstream program may have a way to account for batch just like the above, but usually not. In this scenario, we have to actually aim to modify our underlying data. This usually involves estimating the effect of batch and then directly adjusting the underlying data so that the batch effect is effectively wiped away.

ADD REPLYlink written 5 weeks ago by Kevin Blighe63k

Thanks @Kevin, I do like your way of clarifying issues! I appreciate if you answer my question. I don't want to do differential expression analysis, I just want to normalize my RNAseq data and then do some clustering and classification .. Based on your explanation I realized after adding the batch in the design formula, I also have to remove the batch effect, right?

ADD REPLYlink written 5 weeks ago by Rahil150

after adding the batch in the design formula, I also have to remove the batch effect, right?


here is a code that I tried for clustering on DESeq2 normalized counts:

dds <- DESeqDataSetFromMatrix(countData=counts, colData=factors, design = ~ Batch + Covariate)
dds <- DESeq(dds)
vsd <- vst(dds, blind=FALSE)
mat <- assay(vsd)
mat <- limma::removeBatchEffect(mat, vsd$Batch)
assay(vsd) <- mat
counts_batch_corrected <- assay(vsd)
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Arindam Ghosh300
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1072 users visited in the last hour