Best method for batch correction of three datasets
Entering edit mode
12 weeks ago
CTLong ▴ 90

Hi all,

For the biological question that I would like to answer for my project, this requires the integration of two to three publicly available datasets to perform pairwise DE analysis between various conditions. As much as we would like to generate a single dataset containing all the conditions (to minimize batch effect), we do not have the capacity to do so.

The two RNA-seq datasets that I would like to use were sequenced under the same platform with the same capture method. These have the following sample sizes:

Dataset 1: Condition A (n = 100) and Condition B (n = 300)
Dataset 2: Condition C (n = 50) and Condition D (n = 20)
Dataset 3 (OPTIONAL) = Condition B (n = 30) and Condition C (n = 100)

In this regards, what is the "best" method that can account for the batch effect while preserving biological differences in the differential expression result? Optionally, does also incorporating dataset3 (which contains shared conditions of dataset1 and 2) and using RUV-seq benefit the correction?

RNA-seq • 468 views
Entering edit mode
12 weeks ago

If you batch correct all three datasets together, datasets and 3 have Condition B in common, and datasets 2 and 3 have condition C in common, you should be able to get some (perhaps good, perhaps bad) estimate of the dataset specific effects.

The best way to do the correct depends on what you want to do down stream. If you which to do DE, then I would just build a single dataset with all 3 datasets, and build the sample info table with two columns, one for dataset and one for condition, and then use a study design ~dataset + condition in your favourite DE tool.

There are some assumption here, the primary amoungst them is that the effect of the batch is linear in NBGLM coefficient space.

You might want to try taking your combeined 600x19,000 matrix, and running limma::removeBatchEffects on it, then doing DESeq2::vst and a PCA of the results. I wouldn't use data corrected this way for DE, but it will give you a feeling for whether linear effects removal can remove batch effects.

RUV-seq is usually used to identify unseen factors in the data. Here we know the factors (which dataset a sample is from).

Entering edit mode

Thanks for the very descriptive reply. I will give it a try nonetheless because this is pretty much the only way to account for the biological question I have in mind. I've read papers that use RUVg for batch correction on the basis of negative control genes so that it disregards all assumptions and uses these as anchors. But I agree with you that limma::removeBatchEffects or directly blocking for batch during DE analysis are probably the most straightforward approaches given the batches are known and the effect is linear.


Login before adding your answer.

Traffic: 2285 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6