Precision-recall curve to compare metagenome classifiers to a gold standard - Is this even a viable method?
Entering edit mode
22 months ago

Hi! I am working on the benchmarking of different metagenomic classifiers with in silico datasets, and looking for some metrics to compare their results from the standpoint of classification accuracy. I found this article where the authors suggest the use of precision-recall curves and AUPRC as comparison metrics. Currently I am trying to implement this, but I am not sure that it's even a good metric for this analysis. My curves seem all over the place. I am not sure if I should apply the thresholding to the reference dataset (gold standard), without that my curves look okay-ish - see picture, but if I threshold the reference too, it just gets messy - other picture. I am not sure if my calculation is wrong or the method itself is unfit for this kind of comparison. I am lost how I should even calculate the are under curve from this...

classification recall precision metagenome • 1.1k views
Entering edit mode

Hi, I'm trying to construct a similar curve using Kraken2 and Bracken report files. I'm trying to use the sklearn.metrics.precision_recall_curve, however I cant figure out exacty what should be the input. I have gone through other articles too, but have trouble figuring out exacty what is the input data. How do I calculate the precision and recall values from the kraken2 report. The only fields I see in the kraken2 report are the percentage classified, the number of reads at clade level, number of reads at taxon level, the taxon ID, the taxa level and the name of the organism. Could you please share exactly what the input format and steps you followed to get this plots. Thanks in advance!

Entering edit mode


Sorry for the slow answer. I am working on something very similar right now, so I looked into this a bit more and I am not sure the sklearn.metrics.precision_recall_curve() method is the way to go. It requires a prediction probability score and while the relative abundancy score would seem fitting for this, I don't think it should be used. The probability score is the threshold for each prediction in a machine learning model, showing how "sure" it is in the prediction. Meanwhile the relative abundance has nothing to do with probability (for non-ML-based classifiers), it just says: "Prevotella copri makes up 1/5 of this sample". It tells nothing about how probable that the classified reads/k-mers/markers/proteins etc. actually belong to Prevotella copri. Moreover, the probability scores for ML models are also independent, meanwhile the relative abundances add up to one. I am currently comparing the scikit method to my own implementation of the precision-recall curve, and they result in vastly different values.

About the technical stuff: I use the percentage values from the Kraken2 report file (I select only the species values in my case) and make a python dictionary out of it. I wrote a simple thresholding function that only keeps the keys in the dictionary if their value is above the threshold. I compare these remaining keys to the ground truth data, count true positives, false positives and false negatives and calculate the precision and recall from them. I loop trough different threshold values, save the precision-recall value pairs in every case and plot them with mathplotlib.

I hope this helps!

Entering edit mode
22 months ago

Precision-Recall (PR) curves are mainly useful when benchmarking imbalanced data sets for which true negatives vastly outnumber true positives. PR curves are a much better measure of the prediction quality in such cases than ROC curves. However, PR curves have disadvantages, and I am not sure they are the best choice for evaluating metagenomics tools for taxonomic classifications. Nevertheless, based on the article in your question, I wrote some Python code that might help you with your calculations. At least you will be able to compare it with your calculations.

enter image description here

import random
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# List of all species that are actually in the sample (ground true)
TRUE_SPECIES = [f'species{i}' for i in range(100)]
# Your predictions: a dict where keys are predicted species and values
# are corresponding scores
MY_SPECIES = {f'species{i}': max(random.random(), 0) for i in range(200)}

labels = []
scores = []
for species in set(TRUE_SPECIES).union(MY_SPECIES):
    label = 1 if species in TRUE_SPECIES else 0
    score = MY_SPECIES.get(species, 0.0)

precisions, recalls, thresholds = metrics.precision_recall_curve(labels, scores)
auprc = metrics.auc(recalls, precisions)
plt.title('Precision Recall Curve')
plt.plot(recalls, precisions, 'b', label = f'AUPRC: {auprc:.3}')
plt.legend(loc = 'upper right')
plt.ylim([0, 1])
Entering edit mode
22 months ago
Mensur Dlakic ★ 27k

There is a Critical Assessment of Metagenome Interpretation (CAMI) every few years where different binning programs are assessed. Their papers and software could potentially be interesting.

Or see how they did in in recent paper with new binning programs:


Login before adding your answer.

Traffic: 2255 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6