Question

Batch correction with ComBat for samples from multiple sampling sites

0

Entering edit mode

6 days ago

bobia9193 ▴ 20

Hello all,

I have a dataset containing samples collected from multiple sampling sites, which I will use to build a machine learning model. The number of samples in each site ranges from 2 to 26. After running PCA, I found that samples from the same sites were clustered together, indicating a batch effect.

I now want to remove batch effect using ComBat in the sva package in R. The issue is that this package can only run on the whole dataset, but I need to split the dataset into training and testing sets to avoid data leakage. I have done research and know that after splitting, I can apply ComBat with the option ref.batch to set the training set as a reference for adjusting the test set. Hence, I can avoid data leakage.

My questions are:

How can I deal with this case? The training set will have several sampling sites. Should I select one site randomly (e.g., the site with the largest number of samples) and use it as a reference? For example, the dataset has 100 samples, 80 samples for training and 20 samples for testing. In the training set, I will include all samples from Site 1 (26 samples) and add randomly samples from other sites up to 80. Next, I will use Site 1 as a reference to adjust the training and testing sets.
Are there any better methods in this case?

I appreciate all your feedback. Thank you.

batch-correction ComBat • 417 views

ADD COMMENT • link updated 21 hours ago by Kevin Blighe 89k • written 6 days ago by bobia9193 ▴ 20

0

Entering edit mode

Is this really a batch effect? We usually use the word "batch" to mean a technical artifact that has no biological significance, like differences caused by different people handling different samples. But this looks like a real, and strong, biological difference cause by location.

ADD REPLY • link 4 days ago by swbarnes2 15k

0

Entering edit mode

I think it is a batch effect. Or if it is not, how can I exclude the effect of that environmental factor? Thanks.

ADD REPLY • link 4 days ago by bobia9193 ▴ 20

score 0 · Answer 1 · 2025-11-06

Hi bobia9193,

Thanks for the detailed description of your setup—sounds like a classic case of site-specific technical variation confounding your biological signal, and it's smart that you're prioritising a train/test split to keep things honest for ML.

To your first question: yes, the ref.batch option in sva::ComBat is designed precisely for scenarios like yours, where you want to adjust the test set without retraining the model on it (avoiding leakage). The key is that ref.batch tells ComBat which batch(es) to "protect" during adjustment—i.e., it fits the correction parameters on the reference and pulls everything else towards it. Since your training set spans multiple sites, you don't want to arbitrarily pick just one site as the reference (e.g., your Site 1 with 26 samples). That could bias the correction towards that site's specific profile, potentially over- or under-correcting samples from other sites in both train and test.

Instead, here's a cleaner approach:

Treat the entire training set as your reference "batch" by assigning a dummy batch label to all train samples (e.g., batch = "train" for all 80 samples, regardless of site). For the test set, assign batch = "test".
Run ComBat on the combined (train + test) dataset, but specify ref.batch = "train". This fits the correction solely on the train data and adjusts the test samples to match the train's overall distribution—without leaking test info back into the model.
Crucially, after correction, remove the dummy batch column from your final features before feeding into ML.

This way, you're not forcing a single site to dominate, and it handles the multi-site heterogeneity in train naturally. Your example of including all of Site 1 + random others for train is fine for balancing, but don't tie the reference to Site 1 specifically.

As for implementation, something like this in R (assuming your data is a matrix expr with rows=features, cols=samples, and pheno has batch and site columns):

# Dummy batch for train/test
pheno$batch <- ifelse(pheno$split == "train", "train", "test")  # assuming you have a 'split' column

# Run ComBat with ref
library(sva)
expr_corrected <- ComBat(dat = expr, batch = pheno$batch, ref.batch = "train",
                         mod = model.matrix(~site, data=pheno))  # include site as covariate if you want to preserve site effects

# Now subset back to train/test as needed, drop batch column

The mod argument lets you protect biological covariates (like site, if relevant for your ML target).

To your second question: while ComBat is solid for this, there are a few "better" (or at least more ML-friendly) alternatives, depending on your data scale and goals—especially since you're building a predictive model, not just doing DE:

Model batch explicitly in ML: Skip correction altogether and include site (or other batch proxies) as a feature in your model (e.g., one-hot encoded). Tree-based methods like random forests or gradient boosting handle this well without assuming a parametric form for the batch effect. Quick to implement and interpretable.
limma::removeBatchEffect(): If you prefer parametric correction, this is often more flexible than ComBat for RNA-seq-like data. It subtracts batch effects while preserving your design (e.g., site or other vars). Apply it separately to train (fit on train only), then transform test using the same coefficients. See the limma user's guide (section on batch adjustment).
Harmony or fastMNN: If your data is high-dimensional (e.g., after PCA/dimensionality reduction), these mutual nearest neighbours methods integrate batches non-parametrically and work great for train/test scenarios. Harmony (R package) can be fit on train and projected onto test. They're popular in single-cell but scale to bulk too.
Assess success: Post-correction, always re-run PCA on train-only (pre-ML) to confirm sites no longer drive the top PCs. For test, check that its distribution overlaps train's without obvious shifts.

I'd lean towards explicit modeling in ML if your sites aren't too numerous (<10-15 levels), as it avoids any correction artifacts. If correction is a must, the dummy-batch ComBat tweak above should do the trick without much hassle.

Kevin