Precision-recall curve to compare metagenome classifiers to a gold standard - Is this even a viable method?
2
0
Entering edit mode
11 weeks ago

Hi! I am working on the benchmarking of different metagenomic classifiers with in silico datasets, and looking for some metrics to compare their results from the standpoint of classification accuracy. I found this article https://www.sciencedirect.com/science/article/pii/S0092867419307755 where the authors suggest the use of precision-recall curves and AUPRC as comparison metrics. Currently I am trying to implement this, but I am not sure that it's even a good metric for this analysis. My curves seem all over the place. I am not sure if I should apply the thresholding to the reference dataset (gold standard), without that my curves look , but if I threshold the reference too, it just . I am not sure if my calculation is wrong or the method itself is unfit for this kind of comparison. I am lost how I should even calculate the are under curve from this...

classification recall precision metagenome • 326 views
3
Entering edit mode
11 weeks ago

Precision-Recall (PR) curves are mainly useful when benchmarking imbalanced data sets for which true negatives vastly outnumber true positives. PR curves are a much better measure of the prediction quality in such cases than ROC curves. However, PR curves have disadvantages, and I am not sure they are the best choice for evaluating metagenomics tools for taxonomic classifications. Nevertheless, based on the article in your question, I wrote some Python code that might help you with your calculations. At least you will be able to compare it with your calculations.

import random
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# List of all species that are actually in the sample (ground true)
TRUE_SPECIES = [f'species{i}' for i in range(100)]
# Your predictions: a dict where keys are predicted species and values
# are corresponding scores
MY_SPECIES = {f'species{i}': max(random.random(), 0) for i in range(200)}

labels = []
scores = []
for species in set(TRUE_SPECIES).union(MY_SPECIES):
label = 1 if species in TRUE_SPECIES else 0
score = MY_SPECIES.get(species, 0.0)
labels.append(label)
scores.append(score)

precisions, recalls, thresholds = metrics.precision_recall_curve(labels, scores)
auprc = metrics.auc(recalls, precisions)
plt.title('Precision Recall Curve')
plt.plot(recalls, precisions, 'b', label = f'AUPRC: {auprc:.3}')
plt.legend(loc = 'upper right')
plt.ylim([0, 1])
plt.margins(x=0)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.savefig('prc.png')

1
Entering edit mode
11 weeks ago
Mensur Dlakic ★ 20k

There is a Critical Assessment of Metagenome Interpretation (CAMI) every few years where different binning programs are assessed. Their papers and software could potentially be interesting.

Or see how they did in in recent paper with new binning programs: