Question: Batch effects : ComBat or removebatcheffects (limma package) ?
1
gravatar for lessismore
16 months ago by
lessismore550
Mexico
lessismore550 wrote:

Hey all,

reading from many sources about how to correct for known batch effects present in a data expression compendium, i am now willing to understand which kind of batch effect correction method to apply. For this, i am trying 2 options:

1) removebatcheffects function in Limma package 2) ComBat

Both programs adjust the dataset for known sources of variation that have to be supplied as a "batch" vector. From what i understood with Limma, it's possible to retain the biological expected variation with the "covariates" option. "The covariates argument allows correction for one or more continuous numeric effects, similar to the analysis of covariance method in statistics."

Two questions: 1. Could someone explain me what does this sentence mean? 2. What's the difference between the 2 and why/when using one of the two programs?

Thanks in advance

limma combat batch effects sva • 3.1k views
ADD COMMENTlink modified 13 months ago by Kevin Blighe33k • written 16 months ago by lessismore550
5
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe33k
Republic of Ireland
Kevin Blighe33k wrote:

Buenas tardes amigo/a,

In order to better understand the batch correction methods, I highly recommend reading the following manuscript two or three times: https://academic.oup.com/biostatistics/article/17/1/29/1744261/Methods-that-remove-batch-effects-while-retaining

The conclusion that you should get from reading this is that correcting for batch directly with programs like ComBat is best avoided. If at all possible, include batch as a covariate in all of your statistical models / tests. Programs like DESeq2 allow you to include batch (like any other factor) as a covariate during normalisation.

Kevin

---------------------------------------

Edit 1: 17th June 2018.

It should be pointed out that there is a difference between modelling a batch effect and directly modifying counts in order to counteract a batch effect. Programs like ComBat aim to directly modify your counts in an attempt to eliminate batch effects (it literally 'subtracts' out the modelled effect, which can result in the infamous negative values after employing ComBat). After employing ComBat, statistical tests are conducted on the modified counts, with batch not appearing in the design formula.

However, by including 'batch' as a covariate in design formulae for the purposes of differential expression analysis, one is simply modelling the effect sizes of the batch (without actually adjusting your raw or normalised counts), which are then used to adjust the statistical inferences when calculating p-values for your condition of interest. This does not modify the underlying data. In DESeq2, however, the vst() and rld() transformations can adjust the actual transformed counts based on batch (and anything else in your design formula) by setting blind=FALSE, and this is the recommended procedure by Michael Love when the aim is to use the transformed counts for downstream analyses.

Edit 2: 17th June 2018.

Between ComBat and the removeBatchEffects() function from limma, I would use the limma function.

Further editions:

  • 16th August 2018
ADD COMMENTlink modified 3 months ago • written 13 months ago by Kevin Blighe33k
3

I agree to what Kevin is proposing. Apriori deciding on the batch is really not a good idea. I know people will beg to differ but you will tend to lost true signals as well or fit your data in a way that will have drop-outs. I would try to work out with the sva package and find out the total number of covariates and then use them as factor levels in the statistical model which should be taken care by DESeq2. I have done that already with DESeq2, I have to go back and check its compatibility with limma and edgeR and I am sure one can do that. Another problem is batch effects do not get fully nullified, you are trying to reducing its effect, for projection purposes of normalized data having reduced for batch effect is fine for exploratory PCA or MDS, but that is not used for DE testing, the formula of your DE design should have it either as factor passed or identify the covariates with SVA and pass the value of them in your design. These are mostly multifactorial design. This will to some extent take care of the reduction and over-influence the model for such conditions. Removing is not a good idea and you tend to overfit or lose important informations. Now why I say not using combat or removeBatch function as one cannot be sure what are the batches that might influence the noise, be it operator, different runs, different machines , different library kits, different year of data and different laboratories. So better to throw in the data to find out the confounders that one has and try modeling step-wise using them in design and find out at which surrogate variables you have best representation of your phenotype or condition-specific hypothesis.

ADD REPLYlink modified 13 months ago • written 13 months ago by vchris_ngs4.6k

Thank your for sharing this. I am trying to explain your idea in R code using edgeR package. Could you check whether it is right?

group <- factor(c("mut","mut","mut","wt","wt","wt"))
group <- relevel(group,ref="wt")

seq_batch <- factor(c(1,2,3,1,2,3))
design <- model.matrix(~group+seq_batch)

##### just get normalized counts with batch corrected for new exploratory analysis
norm_cpn <- removeBatchEffect(normalizeVSN(y), batch=seq_batch, design=design) # y is a DEGList object

##### include batch in your differential analysis
y <- estimateDisp(y, design) 
fit <- glmQLFit(y, design, robust=T)
results <- glmQLFTest(fit)
ADD REPLYlink modified 3 months ago • written 3 months ago by sckinta480

I think that it's more or less correct - yes. You may want to take a quick look here, where a similar question was asked: https://support.bioconductor.org/p/76837/

In your example, you produce variance-stabilised counts with the batch effect subtracted out. Downstream, in your comparisons, you're then conducting the comparison with batch in the design formula. This all looks okay to me.

You would later subset your norm_cpn object with your statistically significant genes for the purposes of, for example, clustering and heatmap generation.

ADD REPLYlink written 3 months ago by Kevin Blighe33k

Thank you for the tips. I have one more detail question though. According to the clustering and heatmap before batch removal, I have notice batch 1 and 3 are clustered within each group, while all batch 2 samples are clustered together regardless group. In reality, all batch are prepared independently, and I know there should have three batches. So in this case, should I used

seq_batch <- factor(c(1,2,1,1,2,1)) # based on the heatmap

or use

seq_batch <- factor(c(1,2,3,1,2,3)) # based on the reality
ADD REPLYlink written 3 months ago by sckinta480

My preference would be the 'reality' configuration, but it's these type of judgement calls that can only really be made when one has gone through the entire study's analysis and has taken into account other factors.

ADD REPLYlink written 3 months ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1056 users visited in the last hour