Hi Everyone
I’m working with bulk RNA-seq data and using the limma-voom pipeline. I want to correct for batch effects after voom transformation, and then perform differential expression analysis (DEG).
dge <- calcNormFactors(dge) # Normalizes for library size
design <- model.matrix(~ group, data = metadata) # Only includes group; batch not included here
v <- voom(dge, design) # Converts to log2-CPM and estimates mean–variance trend
# Remove batch effect after voom using linear model
v$E <- removeBatchEffect(v$E, batch = batch, design = model.matrix(~ group, data = metadata))
# v$E can now be used for visualization (PCA, heatmaps, etc.)
# DEG analysis
design2 <- model.matrix(~ group, data = metadata)
fit <- lmFit(v$E, design2)
fit <- eBayes(fit)
results <- topTable(fit, coef = "groupTreatment", number = Inf)
My questions: Is this pipeline statistically valid for DEG analysis?
Does removeBatchEffect() preserve the group differences after correction?
Is this approach preferred over including batch in the design matrix directly?
I’ve read that removeBatchEffect() used for DEG in some cases so I’d appreciate clarification or recommended best practices.
Thank you, Sumeet
TLDR: You don't need to run removeBatchEffects, because voom will automatically remove the batch effects if you tell it about them.
Hi Ian,
Thank you for your response.
Hi ATpoint and i.sudbery
Thank you for the helpful explanation, I really appreciate the clarity.
Following up:
I used removeBatchEffect() on voom$E using a broad group variable (e.g., Healthy, Disease, Treated, etc.) to generate a batch-corrected expression matrix.
Now, I’d like to reuse this corrected v$E to perform multiple DEG analyses based on other metadata columns that were not included in the original design. For example:
Responder vs Non-Responder (from a Response_Status column)
Inflamed vs Uninflamed (from an Inflammation_Status column)
Comparisons across treatment arms (e.g., Drug A vs Drug B)
My plan is:
Reuse the already batch-corrected v$E matrix.
For each comparison, define a new design matrix based on the relevant metadata.
Run lmFit(v$E, design) and eBayes() to extract differential expression.
Is this a statistically sound approach, or should I explicitly model batch again in each new DEG contrast?
Thanks again for your time and insights!
You should explicitly model batch in each new DEG contrast.