Question: Regarding Datamining/Knowledge Based Algorithm
12
gravatar for Suk211
10.3 years ago by
Suk2111.0k
state college
Suk2111.0k wrote:

I am quite fascinated by the field of data mining and it's application in solving biological problems . I have read lot of papers in past few months where people have tried to answer range of problems like predicting protein-protein interaction sites,post translational modification sites, disordered regions in a protein,etc.The more I read about these algorithm , the more doubts it created about reliability of these predictions.

Most of these papers gave me impression as if people are using algorithms like SVM,Random forest, ANN and many more as black box , where u feed in some discriminatory features as input and use analysis measure like ROC curve,MCC measures,etc to prove that you algorithm works better than others.I have also read about some papers where they describe something called "meta predictor" , in which they combine various other predictor's result to arrive at their own prediction values.

I was interested to know ,when you design a data mining based algorithm , how much importance you give to features and how much to the algorithm? Moreover how do you decide , which discriminative feature will give you the best predictive result? will "meta predictor" always give you a better result?

data prediction algorithm • 2.7k views
ADD COMMENTlink modified 10.3 years ago by Aleksandr Levchuk3.2k • written 10.3 years ago by Suk2111.0k

Interesting question and discussions. Thanks for asking !

ADD REPLYlink written 10.3 years ago by Khader Shameer18k
6
gravatar for Nathan Harmston
10.3 years ago by
Nathan Harmston1.1k
London
Nathan Harmston1.1k wrote:

In general ensemble approaches do tend to work better than just one approach (Random Forests seem to be incredibly good for these kind of approaches).

It really depends on the type of classification problem, interdependencies between features, number of features, type of features as to the performance of ML method A. What type of feature selection method you use also affects it.

It generally depends on what you want out of the classification, do you just care about predictive performance or do you you want something more interpretable by humans. For example PCA can be used as a classifier and in some settings can achieve good predictive performance but the features its uses are funky linear combinations and so it is hard to do things like interpreting features and feature selection.

There is a complex interplay between the features you choose and the method you use. Probably the best bet is to try lots of different methods out using cross-validation (3-fold / 5 - fold) and see which one tends to come out best...or plug them into an ensemble method.

HTH

ADD COMMENTlink modified 10.3 years ago • written 10.3 years ago by Nathan Harmston1.1k

Can you please expand CV in last paragraph ? Thanks

ADD REPLYlink written 10.3 years ago by Khader Shameer18k
4
gravatar for Istvan Albert
10.3 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

A meta-predictors usually outperform pure methodologies but that is because most quality measures are overly simplistic. Thus meta predictors end up being tuned too well to the quality measure rather than the interpretation of this quality measure.

A good example was the NetFlix Prize where the winning algorithm was a mixture of dozens of methods and hundreds of parameters. This was all driven by the contest choosing the root mean square as the primary measure of quality. This is a measure that performs best for conservative type predictions. Errors end up being penalized severely.

In contrasts I think the best predictions for science are the risky ones, where there is a good chance of producing something blatantly incorrect as long as there is the chance of uncovering unknown phenomena. These scenarios are, I my opinion, better suited to methods where feature selection has a direct connection to the phenomena of interest.

ADD COMMENTlink written 10.3 years ago by Istvan Albert ♦♦ 84k
4
gravatar for Hanif Khalak
10.3 years ago by
Hanif Khalak1.2k
Doha, QA
Hanif Khalak1.2k wrote:

Most machine learning techniques involve some sort of optimization:

  1. identify the set of parameters to a model that yields the best fit to a set of training data
  2. possibly some [cross-]validation to select the most "robust" parameterization, i.e. one that best fits various [sub]sets of training data.

The goal of (1) is accuracy - for (2) it's generalizability - see this nice discussion by Jeremy Purvis.

Also very relevant is the conclusion of the No Free Lunch Theorem that "bias-free learning is futile". Similarly for hill-climbing algorithms which try to find the maximum of some function: local vs. global optima. Often the absolute optimum answer is not known/knowable, at least not in finite time.

Ensemble or "meta" algorithms would seem to try to balance the various model assumptions and parameterization schemes to get the best "global" result - this makes sense. However, this doesn't mean you get the best possible answer - just the best answer given the way you're asking the question using the tools you happen to have around.

I guess it's like preparing the best meal given whatever happens to be in the house at the time. :-)

ADD COMMENTlink written 10.3 years ago by Hanif Khalak1.2k
2
gravatar for Nicojo
10.3 years ago by
Nicojo1.1k
Kyoto, Japan
Nicojo1.1k wrote:

This is really my two cents to something I feel may be way over my head... But here goes.

I've recently been interested in transmembrane (TM) predictors. There are those that use hydrophobicity or HMM or ANN or SVM or a consensus of different approaches.

The consensus method is what I'd call your "meta-predictor" (please correct me if I'm mistaken).

Now most of these prediction methods are trained on a limited set of curated data. Consequently many of these methods are trained on (more or less) the same datasets.

Also, there are actually a limited number of groups that design these algorithms. Historically, these methods are often improved upon, therefore generating newer algorithms that should "replace" the old. But it is common to find that consensus methods will use the old and the new alike.

Obviously if you investigate proteins that are typical of the datasets used for training, you'll probably get a strait forward answer... But that was not my case.

For those reasons in this particular case of TM predictors, I'm not sure meta-predictors are the way to go.

ADD COMMENTlink written 10.3 years ago by Nicojo1.1k
2
gravatar for Chris Miller
10.3 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

I have a strong preference for machine learning methods that are interpretable. PCA may give great results, but it's very hard to pull features out of those principle components so that I can understand which biological phenomena are driving the system. That's ultimately what I'm after.

That's not to say that some of the more "black box" algorithms aren't useful in some situations. They're just usually not the place that I start.

ADD COMMENTlink written 10.3 years ago by Chris Miller21k

PCA has at least a loading plot... please explain what is wrong with that, from your perspective, and what you like to do instead.

ADD REPLYlink written 9.5 years ago by Egon Willighagen5.2k
2
gravatar for Khader Shameer
10.3 years ago by
Manhattan, NY
Khader Shameer18k wrote:

Here's my thoughts:

Features are backbone of prediction algorithm. They can make or break your prediction algorithm. A generic way to select the features are to perform some analysis on the dataset and assess the generic trend and design features accordingly. In a recent work we used "Information Gain" method to rank the features to identify the top 10 feature. This is generally termed as Feature selection in machine learning. If you are interested to understand the contribution of individual feature you should try feature selection. Another interesting approach we tried recently was creating "hybrid features" by combining two or more features. Our analysis shows that these class of hybrid features are more powerful with SVM and RandomForest. Other possible way (and I am hoping to implement in the future) is to scan a list of all features, rank them and used the top-10 or 50 features in the prediction system.

Meta-predictors usually derive data / algorithms / results from multiple predictors designed for a problem and provide a consensus result using a new statistical method. They obviously show better result, because the final prediction result is based on consensus from multiple methods.

I am not sure due to what aspect you define them as "black-box", extensive literature about the background an theory is available for methods like SVM, RandomForest, ANN. IMHO, Most of these learning algorithms are designed for non-biological situations and later adapted to problems in to biology due to their generic nature and capability of biological problems to provide features based on prior knowledge.

ADD COMMENTlink written 10.3 years ago by Khader Shameer18k
1

'Black Box' suggests that it's hard to get a meaningful explanation of why the classifier made the predictions it did, in a human-interpretable way. This is more true of some classification methods than others. But biologists like meaningful explanations. "It's a transmembrane protein coz the machine said so" often won't wash.

ADD REPLYlink written 10.3 years ago by Andrew Clegg310
1

IMHO, it's not a black-box concept. The machine will predict according to the features during the training step, to assess the prediction one can do testing with data not seen by machine during training and see how it is performing. This is step that can be used for "independent validation" of a learning algorithm - this is very important. Learning algorithm mostly recognize numerical input. When a machine say's "It's a TM protein" because the algorithm developer taught the machine to understand when a protein sequence is having so and so features classify it as "1" other wise "0" or so.

ADD REPLYlink written 10.3 years ago by Khader Shameer18k
1

Algorithm developer further converts this output as "TM" or "non-TM". As I mentioned if you need more insights about the mathematical explanation of the algorithm, there is always a key reference that describes the method in detail.

ADD REPLYlink written 10.3 years ago by Khader Shameer18k
2
gravatar for Lars Juhl Jensen
10.2 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

Although I largely share your view, I am personally not too worried about methods being a back box. The reason is not that I don't care about how the final predictor makes its predictions - just the contrary - but that I find it is usually not too difficult to shine some light into the black boxes. You can assess the relative importance of and correlations among the input features in a variety of ways: linearization of the classifier, statistical analysis of the individual input features, leave-one-feature-out analysis to see how much performance decreases if a feature is removed etc.

In terms of what is most important, my experience is that the choice of input features is more important that the choice of machine learning algorithm. However, the quality of the dataset is even more important - in my experience one is usually better off investing time in improving the data used for training instead of experimenting with different algorithms.

ADD COMMENTlink written 10.2 years ago by Lars Juhl Jensen11k
0
gravatar for Andrew Clegg
10.3 years ago by
Andrew Clegg310
London
Andrew Clegg310 wrote:

If you pick a decent set of features, even Naive Bayes often works well, despite its crude and often incorrect assumptions.

It's a good idea to baseline your selected features on something simple like that.

(Note that I'm speaking as a bit of an amateur here.)

ADD COMMENTlink written 10.3 years ago by Andrew Clegg310
0
gravatar for Aleksandr Levchuk
9.5 years ago by
United States
Aleksandr Levchuk3.2k wrote:

I recently completed a data mining course with Eamonn Keogh.

Also, did a course research project building a Real-vs-Random protein sequence classier:

What I learned was:

  • Often the simples algorithms work best
  • Good features are much much more important that the type of algorithm employed
ADD COMMENTlink modified 2.0 years ago by RamRS30k • written 9.5 years ago by Aleksandr Levchuk3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2093 users visited in the last hour