Batch effects : ComBat or removebatcheffects (limma package) ?
1
6
Entering edit mode
7.3 years ago
lessismore ★ 1.4k

Hey all,

Reading from many sources about how to correct for known batch effects present in a data expression compendium, I am now willing to understand which kind of batch effect correction method to apply. For this, I am trying 2 options:

  1. removeBatchEffects function in Limma package
  2. ComBat

Both programs adjust the dataset for known sources of variation that have to be supplied as a "batch" vector.

From what I understood with Limma, it's possible to retain the biological expected variation with the "covariates" option. "The covariates argument allows correction for one or more continuous numeric effects, similar to the analysis of covariance method in statistics."

Two questions:

  1. Could someone explain me what does this sentence mean?
  2. What's the difference between the 2 and why/when using one of the two programs?

Thanks in advance

limma sva Combat batch-effect • 47k views
ADD COMMENT
28
Entering edit mode
7.0 years ago

In order to better understand the batch correction methods, I highly recommend reading the following manuscript two or three times: https://academic.oup.com/biostatistics/article/17/1/29/1744261/Methods-that-remove-batch-effects-while-retaining

The conclusion that you should get from reading this is that correcting for batch directly with programs like ComBat is best avoided. If at all possible, include batch as a covariate in all of your statistical models / tests. Programs like DESeq2 allow you to include batch (like any other factor) as a covariate during normalisation.

Kevin



Edit 1: 17th June 2018 ...and again in the early hours of October 8, 2019

It should be pointed out that there is a difference between modelling a batch effect and directly modifying data in order to counteract a batch effect. Programs like ComBat aim to directly modify your data in an attempt to eliminate batch effects (it literally 'subtracts' out the modeled effect, which can result in the infamous negative values after employing ComBat). After employing ComBat, statistical tests are conducted on the modified data, with batch not appearing in the design formula.

However, by including 'batch' as a covariate in a design formula for the purpose of differential expression analysis, one is simply modeling the effect size of the batch, without actually modifying your data. The modeled effect size is then taken into account when statistical inferences are made.

If you feel that you still need to correct for batch in your workdlow, then follow the guidance, here: Why after VST are there still batches in the PCA plot?.


Edit 2: 17th June 2018.

Between ComBat and the removeBatchEffect() function from limma, I would use the limma function.


Edit 3: 15th October 2019.

A further answer given here: Batch correction by using ComBat function in sva package


Edit 4: 28th April 2021.

I have learned that ComBat-Seq is now available, which is designed for batch-adjustment of bulk RNA-seq raw counts - this may be useful. Please see: https://github.com/zhangyuqing/ComBat-seq

Previously, [crazy] people had been using the original ComBat —which was designed on array data— for batch adjustment of raw counts.

ADD COMMENT
5
Entering edit mode

I agree to what Kevin is proposing. Apriori deciding on the batch is really not a good idea. I know people will beg to differ but you will tend to lost true signals as well or fit your data in a way that will have drop-outs. I would try to work out with the sva package and find out the total number of covariates and then use them as factor levels in the statistical model which should be taken care by DESeq2. I have done that already with DESeq2, I have to go back and check its compatibility with limma and edgeR and I am sure one can do that. Another problem is batch effects do not get fully nullified, you are trying to reducing its effect, for projection purposes of normalized data having reduced for batch effect is fine for exploratory PCA or MDS, but that is not used for DE testing, the formula of your DE design should have it either as factor passed or identify the covariates with SVA and pass the value of them in your design. These are mostly multifactorial design. This will to some extent take care of the reduction and over-influence the model for such conditions. Removing is not a good idea and you tend to overfit or lose important informations. Now why I say not using combat or removeBatch function as one cannot be sure what are the batches that might influence the noise, be it operator, different runs, different machines , different library kits, different year of data and different laboratories. So better to throw in the data to find out the confounders that one has and try modeling step-wise using them in design and find out at which surrogate variables you have best representation of your phenotype or condition-specific hypothesis.

ADD REPLY
2
Entering edit mode

Thank your for sharing this. I am trying to explain your idea in R code using edgeR package. Could you check whether it is right?

group <- factor(c("mut","mut","mut","wt","wt","wt"))
group <- relevel(group,ref="wt")

seq_batch <- factor(c(1,2,3,1,2,3))
design <- model.matrix(~group+seq_batch)

##### just get normalized counts with batch corrected for new exploratory analysis
norm_cpn <- removeBatchEffect(normalizeVSN(y), batch=seq_batch, design=design) # y is a DEGList object

##### include batch in your differential analysis
y <- estimateDisp(y, design) 
fit <- glmQLFit(y, design, robust=T)
results <- glmQLFTest(fit)
ADD REPLY
0
Entering edit mode

I think that it's more or less correct - yes. You may want to take a quick look here, where a similar question was asked: https://support.bioconductor.org/p/76837/

In your example, you produce variance-stabilised counts with the batch effect subtracted out. Downstream, in your comparisons, you're then conducting the comparison with batch in the design formula. This all looks okay to me.

You would later subset your norm_cpn object with your statistically significant genes for the purposes of, for example, clustering and heatmap generation.

ADD REPLY
0
Entering edit mode

Thank you for the tips. I have one more detail question though. According to the clustering and heatmap before batch removal, I have notice batch 1 and 3 are clustered within each group, while all batch 2 samples are clustered together regardless group. In reality, all batch are prepared independently, and I know there should have three batches. So in this case, should I used

seq_batch <- factor(c(1,2,1,1,2,1)) # based on the heatmap

or use

seq_batch <- factor(c(1,2,3,1,2,3)) # based on the reality
ADD REPLY
0
Entering edit mode

My preference would be the 'reality' configuration, but it's these type of judgement calls that can only really be made when one has gone through the entire study's analysis and has taken into account other factors.

ADD REPLY
0
Entering edit mode

So i have a similar problem as i was looking into micro-array data sets so in one of the datasets it contains the control and the test samples but the other two it contains only treated samples no control sample . The datasets im trying to use are as follows "GSE100927" contains control and test , GSE24702 and GSE37824 contains only test .

Now since the data contains common tissue type which is heart tissue , and the sample numbers are large , if I have to find DE i can only in case of the first data sets as it is with control but now with others as well as they are from different platform, there comes the batch issue.

As i read for microarray normalization can be done independently as different array platform contain their own method of normalizing the data .Now if i have to combine the rest two data-sets how can i do that ? Would that be correct if i compare with control sample i have with the rest of the datasets? can doing batch correction helps?

"How to integrate multiple data sets from microarray platform prior meta-analysis?" i have seen this where you talked about if the data comes from multi platform but again i would like to know your suggestion

ADD REPLY
0
Entering edit mode

Yes, I mean, if you are really desperate to get some results, then you could follow my advice in the other thread to which you have linked. I think that it is okay, given the set-up that you have. It would not be okay in the following situation: one study contains just Condition A, while the other contains just Condition B. It is not okay in this situation because the condition that you want to study (A vs B) is 100% confounded by batch.

ADD REPLY
0
Entering edit mode

Hello Kevin Blighe

I am working with the Biomark HD Fluidigm high throughput qPCR transcriptomics datasets (consists of 240 target genes and 8 housekeeping genes). The dataset consists of 250 samples processed in 5 different batches (50 samples/batch). Dataset of each batch was normalized to calculate Delta Ct (Difference between the Target gene and Geometric mean of 8 housekeeping gene (i.e GM_RG)). Then, transformed to - Delta Ct (negative delta Ct) because we want to apply downstream analyses like ANOVA to log-scale normalized relative expression values, then just using -1 * Delta Ct as readout is quite valid too. I understand from various posts and threads that either ComBat or removebatcheffects (limma package) would be best to use. My question is what data type should be used as inputs in these packages, Delta Ct or Negative Delta Ct (-1*Delta Ct).

ADD REPLY
0
Entering edit mode

Kevin May I know whether your comments and edits were mainly based on the fact that "Previously, [crazy] people had been using the original ComBat —which was designed on array data— for batch adjustment of raw counts."? Indeed, in the publication associated with ComBat-seq, the author did a simulation and found that (in Figure 3) ComBat-seq has better performance than treating batch as a covariate when the dispersion is dramatically different. It has comparable performance when the dispersion has no changes.

Do you think we should instead choose between ComBat-seq and fitting as covariate according to the gene dispersion info? (whether the estimation of dispersion across batches is accurate could be another issue). Any suggestion?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6