How to assess RNA-Seq batch correction success?
1
0
Entering edit mode
2.4 years ago
borkk ▴ 130

Hi

When batch correcting gene expression data one could either do a two step procedure (e.g. Combat + DE analysis) or in a single step where batch is incorporated into the stats as a covariate.

This discussion (Batch effects : ComBat or removebatcheffects (limma package) ?) and the paper it references indicate the second way is preferable.

However the first method had the benefit of allowing a 'corrected' dataset to be assessed using PCA, clustering.

Is there any way (e.g. in DESeq2, edgeR or Limma) to do a similar assessment of batch correction sucess using the single-step method?

RNA-Seq DESeq2 Limma Combat batch correction • 878 views
1
Entering edit mode

edgeR has the removeBatchEffect() function which allows plotting of the corrected values.

1
Entering edit mode
2.4 years ago

Use removeBatchEffect() from Limma

0
Entering edit mode

Thanks for your quick reply Kevin. That would involve the 2 step procedure which is less preferable. I guess my question really is whether there is some statistic we can get after applying the group+batch test that we might be able to apply to uncorrected data to assess how well batch was accounted for in the modelling e.g. with PCA

1
Entering edit mode

If you generate a PCA bi-plot before and after, you could correlate the loadings for the PCs to batch and see which are statistically significant. For example, if PC1 'drives' batch and is correlated at p<0.0001 before batch removal, then hopefully this effect vanishes after batch removal.

Also, before batch, if you transform your loadings to Z scale, you will likely see differences (absolute Z scores) > 3, which are reflective of batch. After batch removal, these should be less. |Z| 1.96 is equivalent to p 0.05

0
Entering edit mode

Thanks for the suggestion Kevin. I might be missing something here but with the single step procedure there wouldn't be any 'corrected' data that I could use for the 'after' PCA. I presume you still mean using removeBatchEffect() anfd have a corrected dataset to compare. However since consensus is that it is 'better' to model batch in the DE analysis, how do we show that batch was modellled sufficiently well and accounted for? Or is this sort of thing just taken as a given?

2
Entering edit mode

I see what you mean. Including the batch in the model in this way is essentially saying that batch's effect is additive, and this will be handled by the underlying regression model.

Most / All that I know make the assumption that batch has been modeled correctly. For this to work, all that I can say is that the batch effect should be consistent across your samples; otherwise, it will be difficult to model.

Im not sure that there is anything inherent to limma, EdgeR, or DESeq2 to check how well a particular model term has been fit. If we were just fitting the model using our own functions, then it would be easier to check.

You could probably simply check the batch term and generate a results table from it. If batch is real, you should obviously see statistically significant differences. If many p-values are NA, it may indicate that they failed Cook's test / independent filtering / have high variance, which is indicative of a problem.

Michael Love and Gordon Smyth likely have better suggestions - they are mostly on Bioconductor support forum.

As another practical example, in the past, I had a very consistent batch effect and used DESeq2. Batch was included in the design formula only. When I transformed the normalised counts via rlog or vst, I set blind=FALSE, which means that the model formula was 'visible' to the transformation. This itself was sufficient to remove the batch effect.

1
Entering edit mode

Thanks very much for the tips Kevin.