Question: Which Bayesian Classifiers (Binary Data, Large-Scale) Are People Using?
gravatar for Joerg Kurt Wegner
7.7 years ago by
Joerg Kurt Wegner230 wrote:

Which Bayesian classifiers (binary data, large-scale) are people using?

Problem complexity: The data set contains binary Y values. The X vector is binary and contains thousands of features. The data set contains hundred thousand of instances.

The output should deliver prediction+-confidence in one form or another. Any suggestion with clear execution examples are more than welcome.

Any alternatives? I am not focussed on Bayesian models per-se.

data • 3.6k views
ADD COMMENTlink written 7.7 years ago by Joerg Kurt Wegner230
gravatar for Qdjm
7.7 years ago by
Qdjm1.9k wrote:

Classifier choice is a little complicated because of the size of your input feature set.

  • I'd start with a Naive Bayes classifier. Because you have binary inputs and binary outputs, it is dead easy to fit.

Naive Bayes doesn't attempt to model interactions among your features. On one hand this is good because it makes it possible to fit the classifier in O(MN) time (where M is the # of training examples and N is the # of features) but on the other hand if some of your features are redundant or there are strong interactions between them, Naive Bayes won't capture this. That said, Naive Bayes does seem to do well in a lot of situations in which you wouldn't expect it to.

  • I'd also suggest checking whether Logistic Regression in the glmnet R package can handle your dataset. We've run it with as many as 65,000 input features but it does get slow (and require a lot of memory). If you do go this route, you also might consider using the L1 (i.e. LASSO) regularization or elastic net regularization options in order to reduce the number of input features used by the classifier. However, because you seem to have ~100x as many training examples as features you may not have overfitting problems. Logistic regression accounts for linear interactions among your feature set, so it will, for example, be able to detect redundant features and downweight them appropriately.

  • Finally, see whether you can find a Random Forest implementation that works for you. In theory, your input feature set size shouldn't be a problem because the random forest learning algorithm selects small, random subsets of your features to fit its trees, however a lot depends on how the algorithm is implemented. Fans of random forests claim that it can detect arbitrary non-linear interactions among your input features -- but a lot depends on how you set the parameters (specifically # of trees and # of input features / tree).

There are other methods that you might consider but they are less commonly used and may not have implementations that scale well. For example, Tree-Augmented Naive Bayes (TANB) may be appropriate. Support Vector Machines are probably not going to work well in your setting.

All of my suggested methods have outputs that indicate their confidence in their classification. Naive Bayes, Logistic Regression and TANB have underlying probability models so their output can be interpreted directly as a probability of correct classification. That said, you should evaluate this yourself (see my last suggestion below). Random Forests have there own mechanism for assessing confidence in their classification.

I'd start by trying to find R implementations for these methods. The Weka package is also quite popular.

My last suggestion is very important. Before you start, set aside a good chunk of your dataset (10%-50%) for a test set and possibly another chunk for a validation set. The test set you only look at once to evaluate the prediction accuracy of your final classifier. The validation set you can look at multiple times (but not too many) to compare different classification algorithms. Some people forego the validation set and use cross-validation for this purpose but you have so much data that you might not need to do this.

ADD COMMENTlink written 7.7 years ago by Qdjm1.9k

Possible application of NBCs: small training sets. Implemented this for a word-based spamfilter once where it outperformed NNs/SVMs.

ADD REPLYlink written 7.7 years ago by Michael Schubert6.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 864 users visited in the last hour