how to select features for high dimension data?
1
0
Entering edit mode
17 months ago
pt.taklifi ▴ 60

Hello everyone. I have a data set with dimension of 330 * 45000 ( 330 samples and 45000 features : reads in peaks) I am looking for a way to select best features for binary classification. so far I only chose feature with covariance higher than 0.5 or less than -0.5 and reduced dimension to 14000. but I know I should reduce dimension furthermore , I'm not sure if I can use randomforest at this stage, do you have any suggestions or tips ?

machine_learning high_dimension • 484 views
7
Entering edit mode
17 months ago
Mensur Dlakic ★ 20k

You really don't need to do feature selection if you plan to classify with tree-based methods such as random forest, except to save a little bit on training time. Best features will be selected automatically among those that provide the best split for a chosen column. Given a small number of samples, I don't think that running time will be a problem for you.

If you still want to go with it, there are many methods for feature selection.Whatever you choose, for such a low number of samples I suggest you couple it with cross-validation. If you want a fairly quick and robust method, linear models with L1 penalty (a.k.a. lasso) will sparsify the features and give you some idea about their optimal number. Finally, I have had good experience with Boruta (also see here) which ranks the features, and you can select as many as you want.

PS Forgot about mutual information-based feature selection.

0
Entering edit mode

thank you for your response it was indeed , very helpful for me. I have a question though, in the case of bigger datasets( more features) about 500,000 features and 1000 samples, what is the best preprocessing method for classification ? I'm looking for a method like variance that doesn't look to sample labels

1
Entering edit mode

I don't know that anything will work well on half a million features. Using variance and correlation (see here and here) is likely to be most productive.

Multicolinearity between features can be determined by calculating the variance inflation factor (VIF), but that is also too slow for 500K features. I just did a quick simulation with 1000 samples and 500 features, and that still took 2 hours on a fast, 12-CPU computer. See a notebook for the VIF implementation, but I don't think that will really help you.

Maybe try something like gradient boosting that is multithreaded and can handle large datasets. More than anything, I would suggest you re-think the strategy that gives you 500K features. In other words, work on reducing the number of features before attempting to classify. Whatever measurements you are making, make fewer than half a million.