Question

Batch effect correction & normalization after training/test split for gene expression data

0

Entering edit mode

2.2 years ago

mmitra ▴ 60

Hi all, I have a few questions regarding building a machine learning binary classifier that uses bulk RNA-seq samples. I have bulk RNA-seq data for around 200 samples. These samples come from 4 different datasets. The samples belong to either class A or B. The samples are split (70:30) in a balanced way (equal representation of both classes) into training and test sets. I have following three questions:

Do I perform the batch correction (using ComBat-seq by dataset) after splitting the samples into training and test sets? So, independently performing batch correction for training and test sets. The result of the ComBat-seq is a batch-corrected raw counts matrix.
If I do the batch correction separately for training and test sets (as described in 1), then the next step would be to normalize the training and test count matrix separately and independently using DESeq2. Will that be OK? I know for scaling the data the mean from the training set should be used for the test set. Is there something like that we need to be do for the DESeq2 normalization? I mean normalization of the test set based on some parameters from the training set normalization (same thing we have to think for batch effect correction as well?).
Lastly, I also want to use a single-cell dataset as a test set as well. What if some of the features of the full trained model (trained using bulk RNA-seq) are not present in the single-cell dataset? How do I deal with that? I will not know if those features are not detected or nor expressed in the cells. So, I am guessing I cannot set those features as zero for all the cells. Or can I?

I would really appreciate any help. I am new to the ML field, so thinking about all these questions. Thanks so much!

RNA-seq batch-effect normalization • 2.5k views

ADD COMMENT • link updated 5 months ago by Ram 44k • written 2.2 years ago by mmitra ▴ 60

0

Entering edit mode

Are you hoping to account for known batch effects (i.e. you have metadata which tells you some of the experiment was done on one day and the rest the next and there is a big difference between results) or do you think there are unknown effects? These situations require different solutions.

ADD REPLY • link 2.2 years ago by BioInfoBeginner ▴ 50

0

Entering edit mode

Thanks for addressing this. Yes, the 4 datasets come from different papers (different research groups). So, I was thinking of correcting for batch effects treating each dataset as a different batch. Each dataset also uses a different cell type but the two classes (A and B) are the same. So there could be biological variations on top of the batch effects. But the main objective is to classify samples into A or B classes irrespective of cell type (general classifier).

ADD REPLY • link 2.2 years ago by mmitra ▴ 60

0

Entering edit mode

Upon some literature search, I see some recent papers addressing the issue of batch correction (and normalization): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8485848/ https://academic.oup.com/bioinformatics/article/33/3/397/2608637

The last one seems to be more applicable to my workflow. It talks about normalization and batch effect corrections methods for training/test sets. The results are based on microarray datasets, so I have to look into how it could be applied to RNA-seq datasets.

ADD REPLY • link 2.2 years ago by mmitra ▴ 60

0

Entering edit mode

For what it's worth, I recently developed a method for batch correction that conditions on confounders, and that is designed to work according to the train-test paradigm. Paper is here: https://arxiv.org/abs/2203.12720, and code is here: https://github.com/calvinmccarter/condo-adapter. Hope this helps!