Question: How to split dataset as train and test data not randomly in Python
0
gravatar for necnec
11 weeks ago by
necnec0
necnec0 wrote:

How can I split my dataset into train and test data sets by deciding certain data should be in the training set and the rest in testing data? I do not want phyton to select randomly, asking about the user to decide it. is it possible in phyton?

I have a small dataset (20 datapoints grouped into two (10 data points in class-1, 10 data points in class-2). I have 30 features of them. I have the second dataset which is even smaller (10 datapoints again grouped into two classes). I want to generate my model by using the first dataset and then use the second (small dataset) to validate my model externally. the aim is seeing how accurate the model for new datasets that is why I don't want to mix the datasets.

thanks in advance.

machine learning python • 150 views
ADD COMMENTlink modified 11 weeks ago by Mensur Dlakic5.8k • written 11 weeks ago by necnec0
1

I don't think you should manually decide which data points go into the test vs. training set, doesn't that defeat the point of training an algorithm?

But ok. What do you want to separate on?

ADD REPLYlink written 11 weeks ago by NRC120

Hi NRC,

thank you for the reply. maybe I couldn't tell my problem clearly as I am new to this area. I have a small dataset (20 datapoints grouped into two (10 data points in class-1, 10 data points in class-2). I have 30 features of them. I have the second dataset which is even smaller (10 datapoints again grouped into two classes). I want to generate my model by using the first dataset and then use the second (small dataset) to validate my model externally. the aim is seeing how accurate the model for new datasets that is why I don't want to mix the datasets. I hope it is clear now. please let me know if it is not.

ADD REPLYlink written 11 weeks ago by necnec0
3
gravatar for Mensur Dlakic
11 weeks ago by
Mensur Dlakic5.8k
USA
Mensur Dlakic5.8k wrote:

Maybe you already know this, but I will say it just in case: it is unlikely that you will be able to make a model that will generalize well based on 20 data points.

With such small datasets, a commonly used approach is leave-one-out (LOO) cross-validation (CV). It is a special case of N-fold CV, where the number of fold is equal to the number of data points - see the LOO section for details. In your case, that means taking 1 data point out of 20 to use for internal validation, and training on the remaining 19. Repeat that another 19 times, each time taking out a different data point for internal validation. You will have 20 models when you are done, which means 20 predictions on your external validation data. Those will be averaged to give you a final prediction. Sklearn's model selection module has a LOO section that will automate most of this process for you.

ADD COMMENTlink written 11 weeks ago by Mensur Dlakic5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2039 users visited in the last hour