Batch correction with ComBat for samples from multiple sampling sites
0
0
Entering edit mode
18 hours ago
bobia9193 ▴ 20

Hello all,

I have a dataset containing samples collected from multiple sampling sites, which I will use to build a machine learning model. The number of samples in each site ranges from 2 to 26. After running PCA, I found that samples from the same sites were clustered together, indicating a batch effect.

I now want to remove batch effect using ComBat in the sva package in R. The issue is that this package can only run on the whole dataset, but I need to split the dataset into training and testing sets to avoid data leakage. I have done research and know that after splitting, I can apply ComBat with the option ref.batch to set the training set as a reference for adjusting the test set. Hence, I can avoid data leakage.

My questions are:

  1. How can I deal with this case? The training set will have several sampling sites. Should I select one site randomly (e.g., the site with the largest number of samples) and use it as a reference? For example, the dataset has 100 samples, 80 samples for training and 20 samples for testing. In the training set, I will include all samples from Site 1 (26 samples) and add randomly samples from other sites up to 80. Next, I will use Site 1 as a reference to adjust the training and testing sets.
  2. Are there any better methods in this case?

I appreciate all your feedback. Thank you.

batch-correction ComBat • 76 views
ADD COMMENT

Login before adding your answer.

Traffic: 6566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6