Calculating Tn (True Negatives) For An Indel Detection Method
2
6
Entering edit mode
10.3 years ago
Pascal ★ 1.5k

This is a kind of follow-up inspired by the very good question/answers I read in "How to calculate sensitivity/selectivity of an algorithm that returns locations of possible matches?"

My goal is to evaluate the Sensitivity/Specificity of an indel detection method.

I have a "gold standard" VCF file (ref.vcf) that states where are exactly the insertions and deletions in my genome. And of course, my indel detection method produces its own VCF file (let's call it test.vcf).

To calculate the True Positives, I detect the intersection of test.vcf and ref.vcf (I use exact intersection for the sake of simplicity for now). The False Positives, are the features in test.vcf that are not in ref.vcf. And False Negatives are the features in ref.vcf that are not in test.vcf.

But how would you calculate the True Negatives? I just can't use the number of positions left (too big number!).

indel • 5.4k views
3
Entering edit mode

Why is the number too big? From my understanding, you have a number of positions that say "nope, no indel here," which is probably the majority of them. For these positions, if there really isn't an indel there, shouldn't that be a true negative? Assuming similar data, you should have mostly true negatives.

1
Entering edit mode

Pascal is correct, the whole number of TN is too large (~3.3e9 for human) such that the figures will be misleading (and drown in rounding error). Therefore it is common practice not to use the standard way of defining specificity like that.

5
Entering edit mode
10.3 years ago

You can use the the Positive Predictive Value (thanks Casey for clearing the definition up).

PPV = TP/(TP + FP)

Sp = TN/(TN+FP)

This has been used in eukaryote gene-prediction where you have a similar case, if you look for coding-regions on a per nucleotide basis, assuming a vast proportion of the genome is not coding. It has the advantage of avoiding the extremely large TN values leading to close to Sp ~ 1 for most cases.

0
Entering edit mode

As you probably know, the genome-wide "specificity" value you refer to is more properly called positive predictive value (PPV http://en.wikipedia.org/wiki/Positive_predictive_value). The (mis)use of the term specificity for PPV causes no end of confusion among students (and researchers). I've found it is better to avoid using the terms sensitivity/specificity and use recall/precision instead, since they are not ambiguously defined.

0
Entering edit mode

So? Actually I found this definition from the book 'Zvelebil, Understanding Bioinformatics' I will look this definition up tomorrow, and see if they got it right or are themselves source of confusion or if I am. And, 'causes no end of confusion', now you are a bit exaggerating, aren't you? But I will correct it and call it PPV then.

0
Entering edit mode

The reason this is important is that terms must be precise to have meaning. I wouldn't doubt it if Zvelebil is wrong on this, it happens in many places. FYI, see wikipedia for the formal classification of performance-related terms: http://en.wikipedia.org/wiki/Sensitivity_and_specificity#Worked_example

0
Entering edit mode

Sorry if I sounded patronizing, that was not my intent. I could have been more direct, but I was afraid that would have sounded hostile.

0
Entering edit mode

I have now checked the text in the textbook "Understanding Bioinformatics", 1s edition (2007, maybe corrected by now?), by Zvelebil & Braun. On p. 365 they use the exact misnomer I was reproducing. They propose PPV and introduce it as specificity, while mentioning a standard definition of specificity (same es Sp in my text) without giving references.

0
Entering edit mode
10.3 years ago
DG 7.2k

I agree with the comment above, that number really id your True Negative count. And yeah, it will be an absurdly large number depending on your dataset. What you will want to do is look beyond simply calculating sensitivity and specificity. In cases where you have an unbalanced number of entries per class (indel no-indel in this case) you want to start looking at something like the F1-score or the Matthews Correlation Coefficient as a better summary statistic for your comparisons.

Something else to analyze the data is to contruct ROC or Precision-Recall curves so you can see how the specificity and sensitivity are interacting with one another.

0
Entering edit mode

Unfortunately, this is incorrect, according to the reasons detailed in the comment.

0
Entering edit mode

It is true that it might be good to look at other measures, but it is also true that it is possible to work with them because there is a way around the large counts, and it is good to use such measures as Sp and Se because they are so well established.