Question

Randomizing train-test data split for classification tasks.

0

Entering edit mode

8 days ago

Shakunthala Natarajan ▴ 10

I am working on a model for a multi-class classification problem based on RNASeq expression data. I initially obtained a machine learning model by splitting my input data into 85% for training and 15% for testing. This 85:15 split was done just once. Next, I split my training set into training and validation subset further with an 80:20 split using sci-kit learn's train_test_split and random.seed(42). However, I want to check for the model performance and variance when I do multiple different 85:15 splits using randomization, while getting the same shuffled indices in the train:validation subset split within the training set with fixed seed value 42. Is this approach correct, or is doing multiple random test-train splits on the input data akin to the model seeing the entire dataset and leading to data snooping bias? It would be great if someone could clarify this for me. Thank you.

split RNASeq test train • 834 views

ADD COMMENT • link 7 days ago by Shakunthala Natarajan ▴ 10

score 2 · Accepted Answer · 2025-11-19

However, I want to check for the model performance and variance when I do multiple different 85:15 splits using randomization, while getting the same shuffled indices in the train:validation subset split within the training set with fixed seed value 42.

It seems that you intuitively understand that using a single train:validation split is not a good idea, which is correct.

Is this approach correct, or is doing multiple random test-train splits on the input data akin to the model seeing the entire dataset and leading to data snooping bias?

This is not akin to the model seeing the whole dataset, but also it isn't the best way of doing it. It is OK if the final models "sees" the whole dataset as long as it is done in a clever way. That clever way is called cross-validation. I don't have time to explain that in the minute detail, but that shouldn't be a problem because you will get enough information by Googling.

Let's start from the top. That 85:15 split at the start is unusual, and I don't know how you came up with those numbers. Most people set aside 10-25% of data, but it is usually done in cleaner ratios such as 9:1, 8:2 or 3:1. I recommend you set aside at least 20% of all the data. You seemed worried what happens at the training stage with the 85% of all the data, but somehow accept that your first 85:15 split is good no matter what. In reality, the trouble can, and often does, start at that first split. I assume you did it randomly. What proof do you have when doing the first random split that all you classes are present in the test dataset (15%) in the same proportions as they are in the training:validation data (85%)? You may get lucky and a random split will achieve this, but you are literally relying on luck. Instead, you should always do stratified fold splitting, at this step and all the others that follow.

So you first do a stratified train:test split, say 80:20 like I suggested. Now we come to your original question: is it OK to do multiple train:validation splits on the 80% of data? I suggest that you do cross-validation (CV), which will do exactly that in a way that is commonly acceptable and pretty close to guaranteeing that you will make a model that will generalize well.

Let's say that you will be doing 5-fold CV, although 10-fold is commonly done as well. That means splitting your training dataset into 5 equal folds, in stratified fashion as I outlined above. Now you do 5 training runs. The first one will use folds 1-4 for training, and the 5th fold for validation and early stopping. When this is done, you classify fold 5 which was not used for training and set those values aside - these are your out-of-fold predictions. You also classify the test data and set those predictions aside. The next training will be done on folds [1-3, 5] and validation will be done on fold 4. Repeat everything as above. In the third run you will use folds [1-2, 4-5] for training and fold 3 for validation. Hopefully the rest of this pattern is obvious.

In the end we will have 1 prediction for each data point from the original train dataset, and they will have been obtained out-of-fold, meaning by training and predicting on different subsets of data. This means there will be no leakage of original class labels. We will also have 5 independent predictions (from each training run) on test data, which are averaged into a single prediction. Now we test how well classes are predicted on out-of-fold train data, and compare that with predictions on averaged test data. If those numbers are similar, we are golden.

score 2 · Accepted Answer · 2025-11-19

Your approach with multiple randomized 85:15 train-test splits is valid and avoids data snooping bias if you evaluate solely on each held-out test set without using it for training or tuning. This is repeated hold-out or Monte Carlo cross-validation, offering better variance estimates than a single split.

However, fixing seed 42 for the 80:20 train-validation split will not yield identical indices across varying train sets, as train_test_split depends on input order. For reproducibility and class balance in multi-class tasks, use the stratify parameter.

Consider stratified k-fold cross-validation (e.g., 5- or 10-fold) via StratifiedKFold for efficiency:

from sklearn.model_selection import StratifiedKFold
import numpy as np

# Assuming X is features, y is labels
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Further split X_train if needed
    # Train and evaluate

This provides similar results with lower computation.

Kevin