Hello all,
I have a dataset containing samples collected from multiple sampling sites, which I will use to build a machine learning model. The number of samples in each site ranges from 2 to 26. After running PCA, I found that samples from the same sites were clustered together, indicating a batch effect.
I now want to remove batch effect using ComBat in the sva package in R. The issue is that this package can only run on the whole dataset, but I need to split the dataset into training and testing sets to avoid data leakage. I have done research and know that after splitting, I can apply ComBat with the option ref.batch to set the training set as a reference for adjusting the test set. Hence, I can avoid data leakage.
My questions are:
- How can I deal with this case? The training set will have several sampling sites. Should I select one site randomly (e.g., the site with the largest number of samples) and use it as a reference? For example, the dataset has 100 samples, 80 samples for training and 20 samples for testing. In the training set, I will include all samples from Site 1 (26 samples) and add randomly samples from other sites up to 80. Next, I will use Site 1 as a reference to adjust the training and testing sets.
- Are there any better methods in this case?
I appreciate all your feedback. Thank you.