How to test the robustness and the performance of a novel classification algorithm for gene expression data
3
0
Entering edit mode
9.9 years ago
fbrundu ▴ 350

Hi all,

suppose you have a new algorithm that you want to publish. Are there any best practices and methodologies you usually consider in order to test the robustness and performance of new methods?

The case is a novel classification algorithm for gene expression samples.

Thanks

algorithm robustness performance test • 3.9k views
ADD COMMENT
2
Entering edit mode
9.9 years ago

In the very least, you should do cross-validation (like leave-one-out-cross-validation) on a dataset. You can also apply the algorithm to other publicly available datasets (if they have metadata for the characteristic that you are trying to predict), which I think is a better test.

In both cases, you can use something like the ROCR package to create and ROC plot showing the tradeoffs between sensitivity and specificity. Creating a table with statistics like the positive predictive value and negative predictive value would also be nice. However, these are all relevant for binary variables - not sure if that is what you are trying to predict.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestions. I will try them.

ADD REPLY
1
Entering edit mode
9.9 years ago

If you mean classification in a strict sense, i.e. supervised clusterisation of samples based on gene expression than the most basic things to do are:

  1. Train your classifier with a relatively large negative and positive sets. Report precision and recall using cross-validation
  2. Select positive and negative validation sets (more is better), ensure that those samples were not used during the training and report precision and recall of trained classifier on those sets

Second step is really critical to show that you're not over-fitting the data..

ADD COMMENT
0
Entering edit mode

Thanks Mikhail. What do you mean with negative and positive sets?

ADD REPLY
0
Entering edit mode

I mean if you have a binary classifier tells e.g. that a sample comes from a tumor or normal tissue, then positive set will be tumor expression datasets and negative sets will be normal expression datasets. Of course it all depends on what your classifier is meant to do..

ADD REPLY
0
Entering edit mode

Unfortunately it is not a binary but a n-classifier. Is there any related technique it is used the most?

ADD REPLY
0
Entering edit mode

The simplest way is to split the problem to several binary classification ones. So the positive set will be some sample type and the negative set will be comprised of other types. Note that positive sets should have a sufficient number of associated samples. For sample types characterized by few samples it will be better to leave them aside and then manually check if they are classified to a reasonable cluster. For accuracy measures for n-classification problem have a look at http://rali.iro.umontreal.ca/rali/sites/default/files/publis/SokolovaLapalme-JIPM09.pdf

ADD REPLY
1
Entering edit mode
9.9 years ago
Christian ★ 3.0k

I stopped believing results from any classifier built from high-dimensional input data (like gene expression data sets) unless results are shown to replicate on a completely independent data set, ideally done by another research group. Cross validation is a minimum must-have, but even with it there is just too much data massaging and overfitting going on.

So if you have access to an independent data set, use it to assess the performance of your classifier before publishing, but be honest and don't cheat and tune your classifier afterwards to improve results. I know this sounds harsh, but the field has been pleagued by unreproducible and non-replicable results for too long.

ADD COMMENT
0
Entering edit mode

Thanks Christian, I will follow your advice..

ADD REPLY

Login before adding your answer.

Traffic: 1608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6