Classifier choice is a little complicated because of the size of your input feature set.
- I'd start with a Naive Bayes classifier. Because you have binary inputs and binary outputs, it is dead easy to fit.
Naive Bayes doesn't attempt to model interactions among your features. On one hand this is good because it makes it possible to fit the classifier in O(MN) time (where M is the # of training examples and N is the # of features) but on the other hand if some of your features are redundant or there are strong interactions between them, Naive Bayes won't capture this. That said, Naive Bayes does seem to do well in a lot of situations in which you wouldn't expect it to.
I'd also suggest checking whether Logistic Regression in the glmnet R package can handle your dataset. We've run it with as many as 65,000 input features but it does get slow (and require a lot of memory). If you do go this route, you also might consider using the L1 (i.e. LASSO) regularization or elastic net regularization options in order to reduce the number of input features used by the classifier. However, because you seem to have ~100x as many training examples as features you may not have overfitting problems. Logistic regression accounts for linear interactions among your feature set, so it will, for example, be able to detect redundant features and downweight them appropriately.
Finally, see whether you can find a Random Forest implementation that works for you. In theory, your input feature set size shouldn't be a problem because the random forest learning algorithm selects small, random subsets of your features to fit its trees, however a lot depends on how the algorithm is implemented. Fans of random forests claim that it can detect arbitrary non-linear interactions among your input features -- but a lot depends on how you set the parameters (specifically # of trees and # of input features / tree).
There are other methods that you might consider but they are less commonly used and may not have implementations that scale well. For example, Tree-Augmented Naive Bayes (TANB) may be appropriate. Support Vector Machines are probably not going to work well in your setting.
All of my suggested methods have outputs that indicate their confidence in their classification. Naive Bayes, Logistic Regression and TANB have underlying probability models so their output can be interpreted directly as a probability of correct classification. That said, you should evaluate this yourself (see my last suggestion below). Random Forests have there own mechanism for assessing confidence in their classification.
I'd start by trying to find R implementations for these methods. The Weka package is also quite popular.
My last suggestion is very important. Before you start, set aside a good chunk of your dataset (10%-50%) for a test set and possibly another chunk for a validation set. The test set you only look at once to evaluate the prediction accuracy of your final classifier. The validation set you can look at multiple times (but not too many) to compare different classification algorithms. Some people forego the validation set and use cross-validation for this purpose but you have so much data that you might not need to do this.
9.3 years ago by
Qdjm ♦ 1.9k