Remove batch effects on the train set to avoid information leakage

0

Entering edit mode

8 months ago

JACKY ▴ 140

I aim to apply Limma's removeBatchEffect function on my data, but only after splitting it into train and test sets. I'm aware that applying batch correction before this partition can introduce information leakage, so I want to avoid that. Previously, I've been batch correcting my entire dataset as follows:

cancer.type = metdata$Cancer_Type
correctedTPM = limma::removeBatchEffect(TPM, batch = cancer.type)

I'd like to adjust my approach: first correct the training set and then utilize the derived parameters from the training set to correct the test set. This is analogous to the best practices for data scaling. Is there a method in R to achieve this with removeBatchEffect or another technique?

r limma batch-effect • 552 views

ADD COMMENT • link updated 8 months ago by Ram 43k • written 8 months ago by JACKY ▴ 140

0

Entering edit mode

I've seen bad experiment design where biological variables get confounded with sequencing batches but this is the first time I'm encountering wanton disregard for biology and abuse of batch correction techniques.

ADD REPLY • link 8 months ago by Ram 43k

Login before adding your answer.