Question

Should differential expression analysis be incorporated in cross validation for training machine learning models?

0

Entering edit mode

7 weeks ago

yordany.perdigon • 0

Hello, I'm conducting some experiments using TCGA-LUAD clinical and RNA-Seq count data. I'm building machine learning models for survival prediction (Random Survival Forests, Survival Support Vector Machines, etc.).

In several papers, I’ve noticed that differential expression analysis is often used as a first step to reduce dataset dimensionality. However, I’m not entirely sure how this step should be integrated into the modeling pipeline.

Specifically, should the differential expression analysis be incorporated within the cross-validation process?

My current idea is to select appropriate samples for the DE analysis (tumor vs. adjacent normal tissue), filter the genes based on the DE results, and then perform cross-validation experiments using this reduced dataset (excluding the samples used for the DE step, the tumor ones, since adjacent tissue samples are not used for model training).

Would this approach be correct? I’m concerned about potential data leakage if DE is done prior to cross-validation.

RNA-seq DEA TCGA Learning Machine • 1.0k views

ADD COMMENT • link 6 weeks ago by yordany.perdigon • 0

score 1 · Answer 1 · 2025-10-14

Yes, their would potentially be data leakage if you perform the DE before dividing your datasets into folds. Its unclear how much of a problem such data leakage would be. If you wanted to test that, then divided your dataset into training and testing. Perform DE on the traning set only, and select DE genes. Then perform cross-validation by dividing the training set into folds, and learning models. Then test the resulting model on the held out testing set that was not used for DE. You could compare this to a situation where you divided your test data into folds, and the computed DE genes on the training part of each fold seperately.

If changing the time you did the DE didn't make much difference, then you could go back and perform DE/Crossvalidation with the original train and test sets merged.