Question

Regarding Datamining/Knowledge Based Algorithm

12

Entering edit mode

15.1 years ago

Suk211 ★ 1.1k

I am quite fascinated by the field of data mining and it's application in solving biological problems . I have read lot of papers in past few months where people have tried to answer range of problems like predicting protein-protein interaction sites,post translational modification sites, disordered regions in a protein,etc.The more I read about these algorithm , the more doubts it created about reliability of these predictions.

Most of these papers gave me impression as if people are using algorithms like SVM,Random forest, ANN and many more as black box , where u feed in some discriminatory features as input and use analysis measure like ROC curve,MCC measures,etc to prove that you algorithm works better than others.I have also read about some papers where they describe something called "meta predictor" , in which they combine various other predictor's result to arrive at their own prediction values.

I was interested to know ,when you design a data mining based algorithm , how much importance you give to features and how much to the algorithm? Moreover how do you decide , which discriminative feature will give you the best predictive result? will "meta predictor" always give you a better result?

data algorithm prediction • 5.4k views

ADD COMMENT • link updated 15.1 years ago by Aleksandr Levchuk 3.2k • written 15.1 years ago by Suk211 ★ 1.1k

0

Entering edit mode

Interesting question and discussions. Thanks for asking !

ADD REPLY • link 15.1 years ago by Khader Shameer 18k

score 6 · Answer 1 · 2010-05-27

In general ensemble approaches do tend to work better than just one approach (Random Forests seem to be incredibly good for these kind of approaches).

It really depends on the type of classification problem, interdependencies between features, number of features, type of features as to the performance of ML method A. What type of feature selection method you use also affects it.

It generally depends on what you want out of the classification, do you just care about predictive performance or do you you want something more interpretable by humans. For example PCA can be used as a classifier and in some settings can achieve good predictive performance but the features its uses are funky linear combinations and so it is hard to do things like interpreting features and feature selection.

There is a complex interplay between the features you choose and the method you use. Probably the best bet is to try lots of different methods out using cross-validation (3-fold / 5 - fold) and see which one tends to come out best...or plug them into an ensemble method.

HTH

score 4 · Answer 2 · 2010-05-27

A meta-predictors usually outperform pure methodologies but that is because most quality measures are overly simplistic. Thus meta predictors end up being tuned too well to the quality measure rather than the interpretation of this quality measure.

A good example was the NetFlix Prize where the winning algorithm was a mixture of dozens of methods and hundreds of parameters. This was all driven by the contest choosing the root mean square as the primary measure of quality. This is a measure that performs best for conservative type predictions. Errors end up being penalized severely.

In contrasts I think the best predictions for science are the risky ones, where there is a good chance of producing something blatantly incorrect as long as there is the chance of uncovering unknown phenomena. These scenarios are, I my opinion, better suited to methods where feature selection has a direct connection to the phenomena of interest.

score 4 · Answer 3 · 2010-05-27

Most machine learning techniques involve some sort of optimization:

identify the set of parameters to a model that yields the best fit to a set of training data
possibly some [cross-]validation to select the most "robust" parameterization, i.e. one that best fits various [sub]sets of training data.

The goal of (1) is accuracy - for (2) it's generalizability - see this nice discussion by Jeremy Purvis.

Also very relevant is the conclusion of the No Free Lunch Theorem that "bias-free learning is futile". Similarly for hill-climbing algorithms which try to find the maximum of some function: local vs. global optima. Often the absolute optimum answer is not known/knowable, at least not in finite time.

Ensemble or "meta" algorithms would seem to try to balance the various model assumptions and parameterization schemes to get the best "global" result - this makes sense. However, this doesn't mean you get the best possible answer - just the best answer given the way you're asking the question using the tools you happen to have around.

I guess it's like preparing the best meal given whatever happens to be in the house at the time. :-)

score 2 · Answer 4 · 2010-05-27

This is really my two cents to something I feel may be way over my head... But here goes.

I've recently been interested in transmembrane (TM) predictors. There are those that use hydrophobicity or HMM or ANN or SVM or a consensus of different approaches.

The consensus method is what I'd call your "meta-predictor" (please correct me if I'm mistaken).

Now most of these prediction methods are trained on a limited set of curated data. Consequently many of these methods are trained on (more or less) the same datasets.

Also, there are actually a limited number of groups that design these algorithms. Historically, these methods are often improved upon, therefore generating newer algorithms that should "replace" the old. But it is common to find that consensus methods will use the old and the new alike.

Obviously if you investigate proteins that are typical of the datasets used for training, you'll probably get a strait forward answer... But that was not my case.

For those reasons in this particular case of TM predictors, I'm not sure meta-predictors are the way to go.

score 2 · Answer 5 · 2010-05-27

2

Entering edit mode

15.1 years ago

Chris Miller 22k

I have a strong preference for machine learning methods that are interpretable. PCA may give great results, but it's very hard to pull features out of those principle components so that I can understand which biological phenomena are driving the system. That's ultimately what I'm after.

That's not to say that some of the more "black box" algorithms aren't useful in some situations. They're just usually not the place that I start.

ADD COMMENT • link 15.1 years ago by Chris Miller 22k

0

Entering edit mode

PCA has at least a loading plot... please explain what is wrong with that, from your perspective, and what you like to do instead.

ADD REPLY • link 14.3 years ago by Egon Willighagen 5.4k

score 2 · Answer 6 · 2010-05-27

Here's my thoughts:

Features are backbone of prediction algorithm. They can make or break your prediction algorithm. A generic way to select the features are to perform some analysis on the dataset and assess the generic trend and design features accordingly. In a recent work we used "Information Gain" method to rank the features to identify the top 10 feature. This is generally termed as Feature selection in machine learning. If you are interested to understand the contribution of individual feature you should try feature selection. Another interesting approach we tried recently was creating "hybrid features" by combining two or more features. Our analysis shows that these class of hybrid features are more powerful with SVM and RandomForest. Other possible way (and I am hoping to implement in the future) is to scan a list of all features, rank them and used the top-10 or 50 features in the prediction system.

Meta-predictors usually derive data / algorithms / results from multiple predictors designed for a problem and provide a consensus result using a new statistical method. They obviously show better result, because the final prediction result is based on consensus from multiple methods.

I am not sure due to what aspect you define them as "black-box", extensive literature about the background an theory is available for methods like SVM, RandomForest, ANN. IMHO, Most of these learning algorithms are designed for non-biological situations and later adapted to problems in to biology due to their generic nature and capability of biological problems to provide features based on prior knowledge.

score 2 · Answer 7 · 2010-07-21

Although I largely share your view, I am personally not too worried about methods being a back box. The reason is not that I don't care about how the final predictor makes its predictions - just the contrary - but that I find it is usually not too difficult to shine some light into the black boxes. You can assess the relative importance of and correlations among the input features in a variety of ways: linearization of the classifier, statistical analysis of the individual input features, leave-one-feature-out analysis to see how much performance decreases if a feature is removed etc.

In terms of what is most important, my experience is that the choice of input features is more important that the choice of machine learning algorithm. However, the quality of the dataset is even more important - in my experience one is usually better off investing time in improving the data used for training instead of experimenting with different algorithms.

score 0 · Answer 8 · 2010-06-01

0

Entering edit mode

15.1 years ago

Andrew Clegg ▴ 310

If you pick a decent set of features, even Naive Bayes often works well, despite its crude and often incorrect assumptions.

It's a good idea to baseline your selected features on something simple like that.

(Note that I'm speaking as a bit of an amateur here.)

ADD COMMENT • link 15.1 years ago by Andrew Clegg ▴ 310

Ram · Answer 9 · 2011-03-22

I recently completed a data mining course with Eamonn Keogh.

Also, did a course research project building a Real-vs-Random protein sequence classier:

What I learned was:

Often the simples algorithms work best
Good features are much much more important that the type of algorithm employed