I have ran a series of simulations using Recursive Feature Elimination (RFE) on random forests (RF) and I obtained some puzzling results. Let's say I run RFE with Cross Validation (CV) and its highest F1 score is sometimes lower than the F1 score I obtain from running RFE and then CV with the same number of folds.
For instance, RFE with CV says the optimal number of features is 90 and its F1 score is 75%, whereas if I independently run RFE selecting only the top 20 features, with the same CV its F1 score is 87%. Why would that happen?
Thank you in advance.
Thank you for your comments. I have edited the original post to make it more accessible.
To make sure we are on the same page, when you're talking about scoring function are you referring to these?
If so, in RFECV I have set scoring to "f1_macro". For RFE I first run it, then create a subset of the top 20 features, then I train a random forest model with said subset, CV set to the same as in RFECV, and scoring set to "f1_macro". I checked and all the other parameters are the same, hence I am confused.
Yes, I was referring to those scoring functions. Generally speaking, I would believe RFECV because it selects the features while doing cross-validation, as opposed to cross-validating the already selected features. I don't know why the difference, but I suspect that even though you think "all the other parameters are the same" they possibly are not.
Also, what happens with
F1
if you take the 90 RFECV-selected features and train a RF model using the same seed for CV? That would be the only way to compare apples to apples.I checked again and indeed the parameters were the same (estimator = RandomForestClassifier, scoring, random state, type and seed of cross-validation (CV)). I discovered why I was getting higher scores for RFE.
For RFECV, I did the following:
And then I asked for the optimal number of features: rfecv.n_features_ Based of that number of features, I obtained the highest F1 Score.
For RFE, I did something similar:
Then, the extra step for RFE was that I reduced X to the selected features,
AND then I obtained the F1 Score.
Furthermore, I know it may seem obvious, but I also confirmed that the features selected by RFECV were consistent with the ones selected by RFE. For instance, if RFECV selected 25 features, and I asked RFE to give me 10 or 20, these were all a subset of the optimal RFECV features.
My questions now are: Why do I get a higher F1 Score when I reduce X to the optimal features selected by RFECV than the F1 Score when I just fit RFECV to X and y? Furthermore, would it be better if I report the highest F1 Score obtained by RFECV or the one after retraining my Random Forest model with X reduced to the selected RFECV features? If I train a Random Forest Classifier with X reduced to the optimal features, I get a higher F1 Score.
Thank you in advance.