Forum:Feedback for my python package to combine results from multiple classifiers
1
2
Entering edit mode
8 months ago
fred.s.kremer ▴ 100

A few years ago I have developed an algorithm to combine the results from multiple classifiers (ex: signal peptide predictors, beta-barrel predictors) in a consensus by using an unsupervised approach. Briefly, the algorithms receives an matrix with the predictions generated by "n" classifiers for "m" proteins and then defines a weight for each classifier based on how much it's results were "confirmed" by the others. The weights are iteratively updated based on a "agreement" (weighted voting) metrics till a stop condition (max number of iterations or convergency).

It was developed in the context of a reverse vaccinology study, where we had to run multiple predictors for the same protein property and then combine them to rank the proteins that were more likely to be good vaccine targets. As no good reference database was available for some properties, we decided to create an unsupervised method.

The code we used on the paper was refactored and now can be installed from the Python Package Index (PyPI).

CoVIRA (GitHub)

Can anyone give me some feedback about how to improve the method, or it's implementation?

7
Entering edit mode
8 months ago
Mensur Dlakic ★ 10k

I have not looked through your code - only read the description and your GitHub example. Take that into account when considering my feedback.

Combining multiple classifiers - I will call it ensemble voting here - is an area of research with a long-standing tradition. What you are trying to do has been studied for a long time, and your approach appears rather simplistic compared to the state of the art. It is very easy to find lots of literature about ensemble classification, so I will not give any references here. I suggest you look for "blending classification models" as your initial search term.

There are at least two problems with your approach: 1) you don't appear to have a gold standard; 2) you are most likely overfitting because there is no out-of-sample data that is used for independent verification.

Without a gold standard, you are weighting based purely on majority. That means that if your majority is wrong 10% of time, your weights don't take that into account but it may still be OK. If your majority vote is wrong 30% of time or more, your weights will be completely wrong and you will be pushing wrong models on top. Weights must be assigned in the context of how predictions relate to correct answers: you want to give a higher weight to a prediction because that prediction is correct, rather than because that prediction is in majority.

You appear to have a small dataset, and with those there is always a potential for overfitting. It is relatively easy for any modern classifier to "learn" the data, such that it appears to be doing well on that particular subset of data but not necessarily on newly acquired data. I see people create classifiers all the time with 98-99% accuracy, and many of them completely crumble on new data. Now, ensemble voting helps with this problem by providing multiple "experts" that can potentially disagree. Still, it is impossible to verify the quality of any classifier by training on all the data. A subset of data must be set aside (a validation dataset), and it is used to monitor and adjust the training process. That goes for individual classifiers and ensemble classifiers. Your approach does not seem to have any kind of validation. Without getting into the weeds, I will suggest that you read about N-fold (or K-fold) validation and/or hold-out validation.

I have saved the part that is least likely to please you for the end - and please forgive my bluntness, because you have doubtless put lots of work into this package. There are literally hundreds of solutions for what you are trying to do, and they are likely much better than yours.

0
Entering edit mode

Hello Mensur!

Yeah, i know the method is very simplistic, but i could not find any method that can be applied on solving the same kind of problem. Usually ensemble methods (eg: adaboost, gradient boosting) are applied to train ensembles of models, but in my case the "ensemble" is performed after the predictions are generated, usually from totally different sources. For example, we had predictions from HHOMP, BOMB,TMBB and a bunch of other outer-membrane protein predictors, generated formore than 3500 proteins, and had no way to combine then into a consensus, nor measure the reliability of each tool for our organism (Leptospira). For this reason i understand that the method is closer to "unsupervised learning" than to a supervised learning, as the model itself doesn't have information about the correct label.

I still want to test the method using a collection of algorithms for a specific task (ex: OMP prediction) and compare the results using CoVIRA and a simple voting, and analyze it using a cross-validation against a well stablished dataset, but except the "voting", i don't known other methods to put on the benchmark.

1
Entering edit mode

Yeah, i know the method is very simplistic, but i could not find any method that can be applied on solving the same kind of problem. Usually ensemble methods (eg: adaboost, gradient boosting) are applied to train ensembles of models, but in my case the "ensemble" is performed after the predictions are generated, usually from totally different sources.

Literally any existing linear or nonlinear classifier can be used to blend the models (find the weights), as long as you are using data labels. You feed your 3 predictions as a new dataset with 3 features and with same labels as used for original classifications - then apply a classifier of your choice to that dataset while also doing cross-validation. When using a non-linear classifier such as neural network, you will get a large number of weights that will not have one-to-one relationship to your individual models, but who cares about that if the network properly weighs the models. Or you can use a linear regression, where feature coefficients will literally be your weights (plus maybe an intercept).

By the way, forcing the weights to be positive and add up to 1 is not necessarily the best solution. In many cases ensembles work better when some model weights are negative, which also means that their sum does not have to equal 1.

Here is a real-life example where logistic regression with cross-validation was used to find the weights, which are positive but do not sum to 1:

Your final model:
[-5.33889345] + ( 2.13666555 * model-1 ) + ( 1.59141858 * model-2 ) + ( 0.84422624 * model-3 )


Or another case where some weights are negative:

Your final model:
[-1.65566392] + ( 1.95763686 * model-1 ) + ( -0.24427485 * model-2 ) + ( 0.04680771 * model-3 ) + ( -0.12435304 * model-4 ) + ( 1.66103053 * model-5 )

0
Entering edit mode

Sure, but how to deal in cases where no labels are available? I my case only had the predictions generated by multiple classifiers and had to rank / filter the proteins based on a consensus. No gold standard was available.

Regarding the way the weights are defined, i'm thinking about using gradient descent instead, and apply some techniques to penalize large weight values ... what do you think?

1
Entering edit mode

Sure, but how to deal in cases where no labels are available?

There are plenty of labels available. How were the original predictors made without labels? Of course there are numerous proteins for which it has been experimentally confirmed that they have signal peptides or whatever else. It's just that you didn't make your ensemble classifier based on that data. You had multiple predictions for your proteins of interest and were trying to average them. That's OK as it will likely give a better accuracy, but I would do a simple majority voting rather than trying to weigh them. It may work brilliantly for this dataset if your individual predictors are diverse and have decent individual accuracy. Still, I would not expect this to work as a general approach.

1
Entering edit mode

Well, i can not say anything but thanks, it was a very constructive feedback. This (simplistic) algorithm was created a few years ago, e just re-written it today, refactoring and making it installable through pip with a CI pipeline and other thinks. Re-working on it started to give some ideas of furthers things to do, and but it's ok if it is just a simple code, at least it provided a good discussion.

Thanks for your time and feedback Professor Mensur, hope to be able to have future conversations with you again about bioinformatics and machine learning :D

1
Entering edit mode

what do you think?

I gave you a link to LR_CV, which is a self-contained solution for this problem when you have labels. Don't know how to do it without labels, but looking through LR_CV code will likely help you.