However, I want to check for the model performance and variance when I do multiple different 85:15 splits using randomization, while getting the same shuffled indices in the train:validation subset split within the training set with fixed seed value 42.
It seems that you intuitively understand that using a single train:validation split is not a good idea, which is correct.
Is this approach correct, or is doing multiple random test-train splits on the input data akin to the model seeing the entire dataset and leading to data snooping bias?
This is not akin to the model seeing the whole dataset, but also it isn't the best way of doing it. It is OK if the final models "sees" the whole dataset as long as it is done in a clever way. That clever way is called cross-validation. I don't have time to explain that in the minute detail, but that shouldn't be a problem because you will get enough information by Googling.
Let's start from the top. That 85:15 split at the start is unusual, and I don't know how you came up with those numbers. Most people set aside 10-25% of data, but it is usually done in cleaner ratios such as 9:1, 8:2 or 3:1. I recommend you set aside at least 20% of all the data. You seemed worried what happens at the training stage with the 85% of all the data, but somehow accept that your first 85:15 split is good no matter what. In reality, the trouble can, and often does, start at that first split. I assume you did it randomly. What proof do you have when doing the first random split that all you classes are present in the test dataset (15%) in the same proportions as they are in the training:validation data (85%)? You may get lucky and a random split will achieve this, but you are literally relying on luck. Instead, you should always do stratified fold splitting, at this step and all the others that follow.
So you first do a stratified train:test split, say 80:20 like I suggested. Now we come to your original question: is it OK to do multiple train:validation splits on the 80% of data? I suggest that you do cross-validation (CV), which will do exactly that in a way that is commonly acceptable and pretty close to guaranteeing that you will make a model that will generalize well.
Let's say that you will be doing 5-fold CV, although 10-fold is commonly done as well. That means splitting your training dataset into 5 equal folds, in stratified fashion as I outlined above. Now you do 5 training runs. The first one will use folds 1-4 for training, and the 5th fold for validation and early stopping. When this is done, you classify fold 5 which was not used for training and set those values aside - these are your out-of-fold predictions. You also classify the test data and set those predictions aside. The next training will be done on folds [1-3, 5] and validation will be done on fold 4. Repeat everything as above. In the third run you will use folds [1-2, 4-5] for training and fold 3 for validation. Hopefully the rest of this pattern is obvious.
In the end we will have 1 prediction for each data point from the original train dataset, and they will have been obtained out-of-fold, meaning by training and predicting on different subsets of data. This means there will be no leakage of original class labels. We will also have 5 independent predictions (from each training run) on test data, which are averaged into a single prediction. Now we test how well classes are predicted on out-of-fold train data, and compare that with predictions on averaged test data. If those numbers are similar, we are golden.
Hi Mensur. I came up with the 85% 15% split to maximize my input for training since I was restricted by the sample size of my input (~9000 samples). Thank you for the detailed explanation. I will incorporate nested cross validation approach in my code.