Question

Best method for batch correction of three datasets

1

Entering edit mode

9 months ago

CTLong ▴ 120

Hi all,

For the biological question that I would like to answer for my project, this requires the integration of two to three publicly available datasets to perform pairwise DE analysis between various conditions. As much as we would like to generate a single dataset containing all the conditions (to minimize batch effect), we do not have the capacity to do so.

The two RNA-seq datasets that I would like to use were sequenced under the same platform with the same capture method. These have the following sample sizes:

Dataset 1: Condition A (n = 100) and Condition B (n = 300)
Dataset 2: Condition C (n = 50) and Condition D (n = 20)
Dataset 3 (OPTIONAL) = Condition B (n = 30) and Condition C (n = 100)

In this regards, what is the "best" method that can account for the batch effect while preserving biological differences in the differential expression result? Optionally, does also incorporating dataset3 (which contains shared conditions of dataset1 and 2) and using RUV-seq benefit the correction?

RNA-seq • 742 views

ADD COMMENT • link updated 8 months ago by ATpoint 84k • written 9 months ago by CTLong ▴ 120

ATpoint · Answer 1 · 2023-12-05

If you batch correct all three datasets together, datasets and 3 have Condition B in common, and datasets 2 and 3 have condition C in common, you should be able to get some (perhaps good, perhaps bad) estimate of the dataset specific effects.

The best way to do the correct depends on what you want to do down stream. If you which to do DE, then I would just build a single dataset with all 3 datasets, and build the sample info table with two columns, one for dataset and one for condition, and then use a study design ~dataset + condition in your favourite DE tool.

There are some assumption here, the primary amoungst them is that the effect of the batch is linear in NBGLM coefficient space.

You might want to try taking your combeined 600x19,000 matrix, and running limma::removeBatchEffects on it, then doing DESeq2::vst and a PCA of the results. I wouldn't use data corrected this way for DE, but it will give you a feeling for whether linear effects removal can remove batch effects.

RUV-seq is usually used to identify unseen factors in the data. Here we know the factors (which dataset a sample is from).