Question

XGBoost on the unbalanced data

0

Entering edit mode

3.0 years ago

mrashad ▴ 80

Dears, I am new in machine learning and try to apply the XGBoost to find the feature importance and plot AUC curve on my data but the samples are unbalanced, the control is 24 samples while the diseased is 153 samples. I tried to make downsampling for the diseased but I don't know to make the downsampling on the whole data before split the data to training and testing the data or after that. If after that, should I make the down sampling on testing data or training data and why ?

Hope someone explain to me and provide me some informative tutorials. Regards,

unbalanced_data • 3.2k views

ADD COMMENT • link 3.0 years ago by mrashad ▴ 80

score 1 · Answer 1 · 2021-04-20

1

Entering edit mode

3.0 years ago

Mensur Dlakic ★ 27k

In my experience, datasets that have the imbalance factor smaller than 10 are usually not a problem, and you can almost train on them without any adjustment. I think a bigger problem in your case is that you simply don't have enough data points.

If you still want to account for this. I would not do it by downsampling the data because you would be throwing away valuable data points and you are already short on data. XGBoost has a parameter called scale_pos_weight which will will down-weight the samples according to the ratio of the two data classes. Specifically, here it is [negative / positive] classes, so if you control is labeled 0 and disease as 1, this ratio would be 0.157. And yes, AUC is the best function to monitor for imbalanced datasets.

ADD COMMENT • link 3.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Tutorial:

https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

ADD REPLY • link 3.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Really thanks for this tutorial, I used it but in the first trial on my data the AUC results appears have no fitting as its results around 0.78 while after try it again the AUC appeared overfitting as its results equal 1. I tried many times again it still 1, do you have any explanation and how I overcome this overfitting ? Is it a good idea to use other model ? if yes, what do you recommend ? or just satisfy with the first result which is 0.78 as it is the first result

ADD REPLY • link 3.0 years ago by mrashad ▴ 80

0

Entering edit mode

I have no idea what exactly you have done, so it is impossible to give you a meaningful advice. Did you use scale_pos_weight? Did you try to vary its values? Did you perform parameter tuning? It is possible to get a classifier that legitimately has AUC=1 as your classes may be very different from each other and therefore easy to classify. So we don't even know with certainty that you are overfitting.

If you followed the tutorial, there are guidelines in it against overfitting. What would certainly help you is to collect more data of both classes, but especially for control cases.

ADD REPLY • link 3.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I followed the tutorial and the results are the same as AUC is nearly 1 but I used this tutorial: https://www.kaggle.com/saxinou/imbalanced-data-xgboost-tunning , the AUC decreased. The main problem meanly changes in the parameters of Xgboost.

ADD REPLY • link 3.0 years ago by mrashad ▴ 80

0

Entering edit mode

I want to ask extra question please, should the random state of splitting the data be equal the random state of classifier ?

ADD REPLY • link 3.0 years ago by mrashad ▴ 80

score 0 · Answer 2 · 2021-04-21

0

Entering edit mode

3.0 years ago

Alexander ▴ 220

The task sounds like pretty balanced, imho the problem should be in very small size of the sample 23+153, which most probably cause and overfit. Boostings are not the first thing to try in that situation, better try more simple linear models and dropping out low predictive features at preprocessing step, and may be binning them - making all to simplify features, in order not to overfit.

As check - generate completely random features and train your model (several times) and look at the scores obtained - if the score of the real model not far from random ones (i.e. -+ 1-2*std ) think about overifitting.

ADD COMMENT • link 3.0 years ago by Alexander ▴ 220

0

Entering edit mode

The task sounds like pretty balanced

When one has 51:49 ratio of classes, or 55:45, or even 60:40, we can call that pretty balanced. A 153:24 ratio is definitely imbalanced, though that is likely not be the biggest problem.

ADD REPLY • link 3.0 years ago by Mensur Dlakic ★ 27k