splitting test and train data
2
0
Entering edit mode
13 months ago

Hello everyone. I have a data set containing 1406 class 0 and 1406 class 1 instances. I want to split my data set to training and test data set by python's sklearn library and at the same time, I want my training data set to be balanced after splitting. I'm wondering whether this issue would be handled by sklearn package or not? I would appreciate your help.

machine learning python scikit-learn • 574 views
2
Entering edit mode

If you are splitting proteins into a training and test set you also want to eliminate pairs of homologous proteins across the training/test set, otherwise you might just end up learning how to recognise homology. As an absolute minimum there shouldn't be any sequences in the test set with >30% sequence identity to the training set (e.g. using blastclust). However it is better to split taking into account evolutionary classifications such as ECOD/CATH, as proteins can be homologous below 30% sequence identity. See https://www.nature.com/articles/s41580-019-0176-5 for more.

0
Entering edit mode

I have removed similar sequences already by CD-HIT to reduce redundancy. I used a 40% cutoff. But I will read the article. Thank you so much for your valuable help.

0
Entering edit mode

Is this a bioinformatics question? It's not obvious what your data are from the description "class 0" and "class 1".

0
Entering edit mode

I'm sorry. I should have mentioned the types of my data. Yes it’s a bioinformatics question. My classes represent thermophilic and mesophilic proteins and the length of each feature vector is 20 (amino acid composition) for each protein.

0
Entering edit mode

Can you also define what you mean by having 'balanced' data sets in this context?

0
Entering edit mode

If you have more instances of thermophilic class relative to instances from the another class (here the mesophile class) your results will be biased toward the class that has majority (in this example it would bias toward the thermophilic proteins). Hence, in order to obtain reliable results you should balance your data set before training. You can read more here

0
Entering edit mode

I see, you mean balanced in terms of pure numbers.

I'm no ML expert, but intuitively I would assume you can simply randomly choose an equal number from each class since your input data is already balanced?

0
Entering edit mode

Yes it’s already balanced. But I’m not sure that if it will remain balanced after splitting. It should be noted that I can do it by myself but I want to do it via python’s scikit-learn library and I’m not sure whether scikit-learn will handle this issue or not.

0
Entering edit mode

You can follow this link here and look for the response by Guiem Bosch. If you try that, it might work. I could have tested it. However, you did not provide a small example of the code and data that you tried.

0
Entering edit mode

Yes, you can split your train and test data with sklearn. https://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/

You can check the above site for many other examples with code

1
Entering edit mode
13 months ago
Mensur Dlakic ★ 11k

Any class that has a word Stratified in sklearn's model_selection category can be used for this purpose. Since you are starting from a perfectly balanced dataset, it is almost a guarantee that you would end up with balanced train and test datasets even if you did a random split.

This can be done manually as well by sorting your data numerically and putting odd and even lines into separate files. The only requirement is that your class is in first column.

0
Entering edit mode

Than you so much. Yes you are right. Due to my balance data set scikit-learn gave me a balanced data set after splitting even by random splitting.

0
Entering edit mode
13 months ago

To make train and test data follow the original data proportion you can use StratifiedShuffleSplit function.

0
Entering edit mode

I appreciate your help and response.