Question

F1 Score of Recursive Feature Elimination (RFE) Top 20 features Greater than RFE Cross Validation (CV) F1 Score

0

Entering edit mode

3.2 years ago

ivnnvi • 0

I have ran a series of simulations using Recursive Feature Elimination (RFE) on random forests (RF) and I obtained some puzzling results. Let's say I run RFE with Cross Validation (CV) and its highest F1 score is sometimes lower than the F1 score I obtain from running RFE and then CV with the same number of folds.

For instance, RFE with CV says the optimal number of features is 90 and its F1 score is 75%, whereas if I independently run RFE selecting only the top 20 features, with the same CV its F1 score is 87%. Why would that happen?

Thank you in advance.

sklearn rfe rfecv f1score randomforests • 2.3k views

ADD COMMENT • link 3.1 years ago by ivnnvi • 0

score 1 · Answer 1 · 2021-03-02

1

Entering edit mode

3.2 years ago

Mensur Dlakic ★ 27k

Generally speaking, using these abbreviations and jargon will not be sufficient for most posters to understand what you are doing or what you are asking. Keep in mind what your audience is next time and how general your question is. This one is very specialized and requires more information if you want constructive responses.

I will do my best with what you have told us. First, doing recursive feature elimination (RFE) simultaneously with cross-validation (CV) (RFECV) is not the same as doing them one at a time. Depending on what you scoring function is during RFECV (what is it you are optimizing for), you may end up with lower F1 score. When you do a separate feature elimination, it may end up with higher F1 score but a lower score that was actually used for optimization. In other words, RFECV will not find feature combination that will optimize absolutely all scores. Any given combo of features may produce the best log-loss score but not the best F1 or vice versa. If you wish to optimize specifically for F1, you should use it as your scoring function.

ADD COMMENT • link 3.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for your comments. I have edited the original post to make it more accessible.

To make sure we are on the same page, when you're talking about scoring function are you referring to these?

If so, in RFECV I have set scoring to "f1_macro". For RFE I first run it, then create a subset of the top 20 features, then I train a random forest model with said subset, CV set to the same as in RFECV, and scoring set to "f1_macro". I checked and all the other parameters are the same, hence I am confused.

ADD REPLY • link 3.1 years ago by ivnnvi • 0

1

Entering edit mode

Yes, I was referring to those scoring functions. Generally speaking, I would believe RFECV because it selects the features while doing cross-validation, as opposed to cross-validating the already selected features. I don't know why the difference, but I suspect that even though you think "all the other parameters are the same" they possibly are not.

Also, what happens with F1 if you take the 90 RFECV-selected features and train a RF model using the same seed for CV? That would be the only way to compare apples to apples.

ADD REPLY • link 3.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I checked again and indeed the parameters were the same (estimator = RandomForestClassifier, scoring, random state, type and seed of cross-validation (CV)). I discovered why I was getting higher scores for RFE.

For RFECV, I did the following:

rfecv = RFECV(estimator=model, step=1, cv=cv, scoring="f1_macro")
rfecv.fit(X,y)

And then I asked for the optimal number of features: rfecv.n_features_ Based of that number of features, I obtained the highest F1 Score.

For RFE, I did something similar:

 rfe = RFE(estimator=model, n_features_to_select=n, step=1)
rfe.fit(X,y)

Then, the extra step for RFE was that I reduced X to the selected features,

subset=rfe.transform(X)

AND then I obtained the F1 Score.

Furthermore, I know it may seem obvious, but I also confirmed that the features selected by RFECV were consistent with the ones selected by RFE. For instance, if RFECV selected 25 features, and I asked RFE to give me 10 or 20, these were all a subset of the optimal RFECV features.

My questions now are: Why do I get a higher F1 Score when I reduce X to the optimal features selected by RFECV than the F1 Score when I just fit RFECV to X and y? Furthermore, would it be better if I report the highest F1 Score obtained by RFECV or the one after retraining my Random Forest model with X reduced to the selected RFECV features? If I train a Random Forest Classifier with X reduced to the optimal features, I get a higher F1 Score.

Thank you in advance.

ADD REPLY • link 3.1 years ago by ivnnvi • 0