Hi bobia9193,
Thanks for the detailed description of your setup—sounds like a classic case of site-specific technical variation confounding your biological signal, and it's smart that you're prioritising a train/test split to keep things honest for ML.
To your first question: yes, the ref.batch option in sva::ComBat is designed precisely for scenarios like yours, where you want to adjust the test set without retraining the model on it (avoiding leakage). The key is that ref.batch tells ComBat which batch(es) to "protect" during adjustment—i.e., it fits the correction parameters on the reference and pulls everything else towards it. Since your training set spans multiple sites, you don't want to arbitrarily pick just one site as the reference (e.g., your Site 1 with 26 samples). That could bias the correction towards that site's specific profile, potentially over- or under-correcting samples from other sites in both train and test.
Instead, here's a cleaner approach:
- Treat the entire training set as your reference "batch" by assigning a dummy batch label to all train samples (e.g.,
batch = "train" for all 80 samples, regardless of site). For the test set, assign batch = "test".
- Run
ComBat on the combined (train + test) dataset, but specify ref.batch = "train". This fits the correction solely on the train data and adjusts the test samples to match the train's overall distribution—without leaking test info back into the model.
- Crucially, after correction, remove the dummy batch column from your final features before feeding into ML.
This way, you're not forcing a single site to dominate, and it handles the multi-site heterogeneity in train naturally. Your example of including all of Site 1 + random others for train is fine for balancing, but don't tie the reference to Site 1 specifically.
As for implementation, something like this in R (assuming your data is a matrix expr with rows=features, cols=samples, and pheno has batch and site columns):
# Dummy batch for train/test
pheno$batch <- ifelse(pheno$split == "train", "train", "test") # assuming you have a 'split' column
# Run ComBat with ref
library(sva)
expr_corrected <- ComBat(dat = expr, batch = pheno$batch, ref.batch = "train",
mod = model.matrix(~site, data=pheno)) # include site as covariate if you want to preserve site effects
# Now subset back to train/test as needed, drop batch column
The mod argument lets you protect biological covariates (like site, if relevant for your ML target).
To your second question: while ComBat is solid for this, there are a few "better" (or at least more ML-friendly) alternatives, depending on your data scale and goals—especially since you're building a predictive model, not just doing DE:
- Model batch explicitly in ML: Skip correction altogether and include site (or other batch proxies) as a feature in your model (e.g., one-hot encoded). Tree-based methods like random forests or gradient boosting handle this well without assuming a parametric form for the batch effect. Quick to implement and interpretable.
- limma::removeBatchEffect(): If you prefer parametric correction, this is often more flexible than ComBat for RNA-seq-like data. It subtracts batch effects while preserving your design (e.g., site or other vars). Apply it separately to train (fit on train only), then transform test using the same coefficients. See the limma user's guide (section on batch adjustment).
- Harmony or fastMNN: If your data is high-dimensional (e.g., after PCA/dimensionality reduction), these mutual nearest neighbours methods integrate batches non-parametrically and work great for train/test scenarios.
Harmony (R package) can be fit on train and projected onto test. They're popular in single-cell but scale to bulk too.
- Assess success: Post-correction, always re-run PCA on train-only (pre-ML) to confirm sites no longer drive the top PCs. For test, check that its distribution overlaps train's without obvious shifts.
I'd lean towards explicit modeling in ML if your sites aren't too numerous (<10-15 levels), as it avoids any correction artifacts. If correction is a must, the dummy-batch ComBat tweak above should do the trick without much hassle.
Kevin
Is this really a batch effect? We usually use the word "batch" to mean a technical artifact that has no biological significance, like differences caused by different people handling different samples. But this looks like a real, and strong, biological difference cause by location.
I think it is a batch effect. Or if it is not, how can I exclude the effect of that environmental factor? Thanks.