Question

splitting test and train data

0

Entering edit mode

4.1 years ago

saber mohammadi ▴ 20

Hello everyone. I have a data set containing 1406 class 0 and 1406 class 1 instances. I want to split my data set to training and test data set by python's sklearn library and at the same time, I want my training data set to be balanced after splitting. I'm wondering whether this issue would be handled by sklearn package or not? I would appreciate your help.

machine learning python scikit-learn • 2.6k views

ADD COMMENT • link updated 4.1 years ago by Mensur Dlakic ★ 27k • written 4.1 years ago by saber mohammadi ▴ 20

2

Entering edit mode

If you are splitting proteins into a training and test set you also want to eliminate pairs of homologous proteins across the training/test set, otherwise you might just end up learning how to recognise homology. As an absolute minimum there shouldn't be any sequences in the test set with >30% sequence identity to the training set (e.g. using blastclust). However it is better to split taking into account evolutionary classifications such as ECOD/CATH, as proteins can be homologous below 30% sequence identity. See https://www.nature.com/articles/s41580-019-0176-5 for more.

ADD REPLY • link 4.1 years ago by jgreener ▴ 390

0

Entering edit mode

I have removed similar sequences already by CD-HIT to reduce redundancy. I used a 40% cutoff. But I will read the article. Thank you so much for your valuable help.

ADD REPLY • link 4.1 years ago by saber mohammadi ▴ 20

0

Entering edit mode

Is this a bioinformatics question? It's not obvious what your data are from the description "class 0" and "class 1".

ADD REPLY • link 4.1 years ago by Joe 21k

0

Entering edit mode

I'm sorry. I should have mentioned the types of my data. Yes it’s a bioinformatics question. My classes represent thermophilic and mesophilic proteins and the length of each feature vector is 20 (amino acid composition) for each protein.

ADD REPLY • link 4.1 years ago by saber mohammadi ▴ 20

0

Entering edit mode

Can you also define what you mean by having 'balanced' data sets in this context?

ADD REPLY • link 4.1 years ago by Joe 21k

0

Entering edit mode

If you have more instances of thermophilic class relative to instances from the another class (here the mesophile class) your results will be biased toward the class that has majority (in this example it would bias toward the thermophilic proteins). Hence, in order to obtain reliable results you should balance your data set before training. You can read more here

ADD REPLY • link 4.1 years ago by saber mohammadi ▴ 20

0

Entering edit mode

I see, you mean balanced in terms of pure numbers.

I'm no ML expert, but intuitively I would assume you can simply randomly choose an equal number from each class since your input data is already balanced?

ADD REPLY • link 4.1 years ago by Joe 21k

0

Entering edit mode

Yes it’s already balanced. But I’m not sure that if it will remain balanced after splitting. It should be noted that I can do it by myself but I want to do it via python’s scikit-learn library and I’m not sure whether scikit-learn will handle this issue or not.

ADD REPLY • link 4.1 years ago by saber mohammadi ▴ 20

0

Entering edit mode

You can follow this link here and look for the response by Guiem Bosch. If you try that, it might work. I could have tested it. However, you did not provide a small example of the code and data that you tried.

ADD REPLY • link 4.1 years ago by botloggy ▴ 10

0

Entering edit mode

Yes, you can split your train and test data with sklearn. https://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/

You can check the above site for many other examples with code

ADD REPLY • link 4.1 years ago by gayachit ▴ 200

score 1 · Answer 1 · 2020-03-31

1

Entering edit mode

4.1 years ago

Mensur Dlakic ★ 27k

Any class that has a word Stratified in sklearn's model_selection category can be used for this purpose. Since you are starting from a perfectly balanced dataset, it is almost a guarantee that you would end up with balanced train and test datasets even if you did a random split.

This can be done manually as well by sorting your data numerically and putting odd and even lines into separate files. The only requirement is that your class is in first column.

ADD COMMENT • link 4.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Than you so much. Yes you are right. Due to my balance data set scikit-learn gave me a balanced data set after splitting even by random splitting.

ADD REPLY • link 4.1 years ago by saber mohammadi ▴ 20

score 0 · Answer 2 · 2020-03-31

0

Entering edit mode

4.1 years ago

Arup Ghosh 3.2k

To make train and test data follow the original data proportion you can use StratifiedShuffleSplit function.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html