Question

Computing ROC curves without scores

0

Entering edit mode

5.9 years ago

Swimming bird ▴ 20

I am working with a gene priorization program and I want to analyze its performance by making use of ROC curves and the ROCR package for R. The problem is that the program does not give me a score for each gene, it only orders the genes.

ROCR only uses continuous data as predictions. Is it possible to assign to each gene a number in descending order? For example, I have these genes ordered:

Gene A
Gene B
Gene C

I could assign these values?

Gene A: 3
Gene B: 2
Gene C: 1

gene prioritization roc auc R rocr • 2.7k views

ADD COMMENT • link updated 5.9 years ago by Jean-Karim Heriche 27k • written 5.9 years ago by Swimming bird ▴ 20

1

Entering edit mode

That's odd that the program doesn't give you a numerical value. Without a numerical value you have no way of knowing whether two genes were tied on a prioritization assessment. So you likely will have to just assume no ties. I think the most explainable way is what you propose, that is you construct the ROC curve based on the rank of the gene in the list. The absolute value of a score doesn't really impact the ROC curve, as you could multiple all scores by some factor X and still get the same ROC curve.

Could you tell us what the gene prioritization method is? I would double check, that the gene prioritization method does not give you a numerical score and that it is not doing a gene set analysis (where there is no meaning in the order of the gene) as opposed to prioritization.

ADD REPLY • link 5.9 years ago by Collin ▴ 1000

0

Entering edit mode

The program is DADA (http://compbio.case.edu/omics/software/dada/) and I think it does a real gene prioritization because it ranks your candidate genes file. Moreover, seed genes are usually at the top of the list. But you can take a look, I would be very grateful.

ADD REPLY • link 5.9 years ago by Swimming bird ▴ 20

1

Entering edit mode

Yep it is a network-based gene prioritization approach. Based on their paper, it does create a score that it uses to rank the genes. However, from their user documents they don't provide it as an output. So from a practical perspective, like Jean-Karim noted, you likely have to use the rank of the gene as a score so that packages like ROCR will generate a ROC curve for you.

ADD REPLY • link 5.9 years ago by Collin ▴ 1000

0

Entering edit mode

Perfect, thanks! Could you recommend any gene prioritization tool which provides scores in the output? I am using DADA because it allows you to load a protein-protein interaction network and is easy to use. It works well but it will be useful to prove other programs.

ADD REPLY • link 5.9 years ago by Swimming bird ▴ 20

score 1 · Answer 1 · 2018-06-02

1

Entering edit mode

5.9 years ago

Jean-Karim Heriche 27k

Although the ROCR package requires scores as input, this is only used to rank the instances. You don't need scores to compute a ROC curve, you can do it with ranks only. See the ROC curves section on mlwiki, in particular look at the practical method described there. Also you can compute the AUC (area under the curve) using ranks only since the AUC is equivalent to the Mann-Whitney U statistics.

ADD COMMENT • link 5.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks! If I want to compare this tool to other one which shows the scores, would be correct to generate the curve using the scores? I suppose it doesn't matter, right?

ADD REPLY • link 5.9 years ago by Swimming bird ▴ 20

0

Entering edit mode

ROC curves (and AUC) may not always be the best to compare/evaluate gene prioritization methods. For example, they don't give you information on the ranks of the true positives or, put another way, how many genes you need to take to get a given sensitivity value, which is often what you want when the genes are prioritized for further experiments. It depends on what is important in your context. For some discussion on this see this paper and for some other considerations, in particular about the problem of evaluation using "circular data" (i.e. data used to train the algorithms, e.g. protein interactions, are the same used to derive annotations used for evaluation, like GO or pathways, resulting in overoptimistic results), see my paper here.

ADD REPLY • link 5.9 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

I disagree on the point that ROC curves doesn't give the following information "how many genes you need to take to get a given sensitivity value". You can in fact pick the corresponding number of genes from the threshold that produces X% sensitivity on the ROC curve (and understand what the false positive rate is for that choice). The AUC doesn't give you a performance metric for the stated X% sensitivity goal but rather a summarized assessment over all possible thresholds. The issue with picking thresholds for performance is that it can be fairly arbitrary and chosen in a way to seemingly increase a method's performance, the AUC prevents this. If you were interested in a specific performance at a given threshold, in order not to mislead readers, one should probably report both the sensitivity/specificity at a threshold AND the AUC.

One of the major issues with a ROC curve, usually cited (https://dl.acm.org/citation.cfm?id=1143874 and https://www.ncbi.nlm.nih.gov/pubmed/25738806), is that it doesn't handle class imbalance. Thus the ROC curve could not be the most meaningful measurement, if the class of interest is substantially underrepresented. In this scenario the precision-recall curve (and AUC) is a better alternative. However, the precision-recall curve is not perfect either. When the class of interest is actually the most common class, then the precision-recall AUC is highly inflated.

Regarding "circular data", no performance metric will be able to compensate for this. Only performing rigorous hold-out analysis or cross-study analysis may shed some light on it.

ADD REPLY • link 5.9 years ago by Collin ▴ 1000

0

Entering edit mode

From a practical point of view, what we are often interested in is: given we have enough resources to test only 100 genes, how many true positives can we expect if we test the top 100 genes ? ROC curves alone don't help to identify the best method for this particular use case. In practical settings, absolute thresholds are usually not hard to identify, they are imposed by the circumstances (availability of reagents, time ...) and for comparing methods, a range can be selected.

ADD REPLY • link 5.9 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

the Precision-Recall curve basically assesses what your saying. Precision is the fraction of predicted positives that are labeled as true positive (also known as positive predictive value). You can always mark with a dot the precise point on the curve that equals the top 100 genes. This would give you knowledge of how well you are predicting the top 100 genes, in this example, and if given more resources whether the next 100 would be even worthwhile to test.

Another point is that you do not need to necessarily test every gene for a given threshold. A completely random selection of genes above the established threshold could be used to assess validation rate for the entire group as a whole.

ADD REPLY • link 5.9 years ago by Collin ▴ 1000