Batch correction in DESeq2 - limited impact for unsupervised downstream analyses
Entering edit mode
3 months ago
Jane • 0


I need to normalize gene expression (RNASeq) data to perform downstream analyses on it:

  1. unsupervised deconvolution using ICA
  2. pathway quantification using decoupleR. I want to see which pathways are important in my cohort and if some pathways are specific to subgroups of clinical features (histology, stage, ...).

Actually, to perform unsupervised deconvolution, I do not have to worry about batch effect since ICA should capture a batch in one component. But I have to pay attention to this to perform pathway quantification.

In my dataset, I have a strong batch effect since 2/3 of the samples were captured using the Kapa kit and the ones which failed were captured using the Takara kit.

I used DESeq2 in several studies some years ago to perform DEA and data visualization ago but, I did not have to remove a batch. I would appreciate advice to get an expression matrix of the normalized counts, corrected for this batch effect. I discussed already the input needed for decoupleR here. It seems possible to use normalized, log normalized or vst counts.

My first idea was to get corrected log normalized counts but it does not seem possible. If I am correct, it is impossible to simply put the batch in the design of the model to do that, since the normalized counts are not affected by the model.

Then, I saw the following solution: after normalization and dispersion estimation, use vsd and limma::removeBatchEffect. I read that it should be better to use vst(dds, blind = FALSE).

Using blind=FALSE, the sample information provided in the design formula are used.

I wanted first to provide only the batch information in the design formula. Nevertheless, to explain the counts, it seems better to provide the main clinical information (gender, histology, grade, age, ...) and batch.

My question is the following: if I introduce all these variables, can I somehow introduce a bias / a signal? Is it the best strategy to perform an unsupervised pathway analysis afterwards?

I am not sure my concern is very clear, but the idea is that I want to provide input data to decoupleR on which there is no batch effect and no effect added artificially when computing vst.

If it is the way to go, could you recommend "how many" variables should be used in the model? I have at least 6 of them, but from the unsupervised deconvolution, I see that some have no effect. Should I use only the 2 - 3 - 4 with the highest effect (+ the batch)?

Thanks in advance for your feedback

deseq2 • 727 views
Entering edit mode
3 months ago

Try ComBat-seq in the sva package. It adjusts the actual counts rather than just accounting for batch effects in the model. It also lets you specify biological variables with signal you'd like to retain. I've found it does a decent job.

As for "how many" variables should be used, that really depends on the data. Use the ones with the largest effect and see how things look pre/post correction and go from there.

Entering edit mode

jared.andrews07, thank you for your feedback anf advice! I have never tried ComBat-seq but have heard about. I will try, thanks. Out of curiosity, why would you recommend ComBat-seq over limma::removeBatchEffect here? Is it to apply a minimum set of changes on the data? (I guess vst transformation is not needed). Or is it for the batch correction by itself? Thanks in advance

Entering edit mode

Is it to apply a minimum set of changes on the data? (I guess vst transformation is not needed).

Mostly this. It adjusts the raw counts, so you have more options downstream given you get integer counts back out. In terms of the actual correction, I've never used limma's function, so I can't really comment on that. I presume it probably works fine.

Entering edit mode

Ok, thanks a lot for your feedback.

Entering edit mode

Hello jared.andrews07

Sorry, I have 2 more questions for you. I am now testing ComBat-seq.

A. I wonder if, prior batch correction, it is useful to remove genes that are all 0 in one of the batches. Actually, this question holds for any batch correction method. In my experiment, ~40 samples were sequenced using the Takara kit and ~80 using the Kapa kit. ~5000 genes are 0 for all samples sequenced with the Takara kit and ~4000 genes are 0 for all samples sequenced with the Kapa kit. (There are only ~2000 genes that are 0 for all samples)

Is it usefull to perform this filtering prior to batch correction or do these genes actually help for the batch correction and should be kept in the matrix?

B. After removing the genes that are all 0 in one of the batches, I ran:

adjusted_data <- sva::ComBat_seq(counts=counts(dds), batch = df_condition$kit, covar_mod = df_condition[,-4])

Found 2 batches
Using null model in ComBat-seq.
Adjusting for 6 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data

then, I want to use DESeq2 to perform normalization:

dds <- DESeqDataSetFromMatrix(countData = adjusted_data,
                              colData =  df_condition[,-4], 
                              design = ~MainGroups_histology + gender + simple_stage)

but I get the following error:

converting counts to integer mode
Erreur dans validObject(.Object) : 
  objet de classe “DESeqDataSet” incorrect: NA values are not allowed in the count matrix

There is no NA in adjusted_data:


Do you know what could be the issue?

Thanks in advance for your help

Entering edit mode

I just solved my issue. As pointed out in other discussions, some gene expressions were very high (e12, even e16). For those you would have the same issue, I was more stringent on the genes filtered before applying Combat-Seq, requiring more reads to keep a gene and this solved the error, as the gene expressions decreased significantly.


Login before adding your answer.

Traffic: 1602 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6