Question

How to implement k fold cv after randomForest

0

Entering edit mode

4.3 years ago

pt.taklifi ▴ 70

Hi everyone ,

I have a high dimension data ( 800 samples with labels "cancer" and "healthy" and 10,000 features for each samples ) so I decided to use randomForest for classification and report the most important features based on randomForest importance . I am wondering how to implement K fold cross validation for this purpose ,should I 1) do k fold cv on all my data and for each feature report the average importance that randomForest gives for every fold or 2) should I split data to train and test set build randomForest model and then do k fold cv on training set only ? in this case how should I choose the most important features?

thank you in advance

k_fold_cv R randomForest • 1.7k views

ADD COMMENT • link 4.2 years ago by pt.taklifi ▴ 70

score 1 · Answer 1 · 2021-04-11

Your setup has what is known as the curse of dimensionality where the number of measured features is much greater than the overall number of recorded samples. It is also known as the p>>n problem (where p is features and n is observations). In terms of a matrix data representation, you have too many columns in your data and not enough rows. That means you have to lengthen one axis of your matrix (increase number of samples), or shorten the other (eliminate some features, or many of them in your case).

You seem to have a good general idea, but for such underdetermined problems it is usually better to avoid complex methods (RFs qualify as such) as they tend to overfit. For such ill-defined problems (10000 >> 800) there are numerous solutions that will have the same error on training data, and it is more difficult for complex models to find a solution that will be general - it is more likely that a solution they find will overfit unless one applies extreme care. If you want to work with decision trees / random forests, I suggest shallow trees with depth not exceeding 2-3. Also, large number of folds and small number of iterations.

I think it may be more productive to work with simpler approaches, for example regularized generalized linear models (GLM). Regularization works with a tuning parameter, usually by adding a constant multiple to the existing weight vector. There are different ways of doing this by applying what is known as L1 and L2 penalties, but the idea is to shrink the regression coefficients close to zero, which effectively removes any feature that is multiplied by zero. To be fair, it is possible for simple models to overly regularize the data, in which case they may be underperforming. Still, given a shape of your dataset, overfitting appears more likely.

Alternatively, you can also apply dimensionality reduction such as PCA and select the top 5-50 principal components, or any number of components where most of the variance will be explained. If your dataset is sparse, PCA won't work but truncated singular value decomposition will.

After all the hand-waving, I will give you a concrete suggestion: try LassoCV because it will shrink down most of feature coefficients to zero (thus eliminating unimportant features), and everything is already neatly folded into the cross-validation approach where you don't have to worry about averaging anything. Simply select a number of folds (for your data I suggest at least 20, and 50 may be even better), the range of alpha coefficient, the number of alphas to test, and off you go. In fact, most default choices are fine and you only need to customize the number of folds for cross-validation.

You may also want to read this.