Question: Calculating Tn (True Negatives) For An Indel Detection Method
gravatar for Pascal
8.4 years ago by
Pascal1.5k wrote:

This is a kind of follow-up inspired by the very good question/answers I read in "How to calculate sensitivity/selectivity of an algorithm that returns locations of possible matches?"

My goal is to evaluate the Sensitivity/Specificity of an indel detection method.

I have a "gold standard" VCF file (ref.vcf) that states where are exactly the insertions and deletions in my genome. And of course, my indel detection method produces its own VCF file (let's call it test.vcf).

To calculate the True Positives, I detect the intersection of test.vcf and ref.vcf (I use exact intersection for the sake of simplicity for now). The False Positives, are the features in test.vcf that are not in ref.vcf. And False Negatives are the features in ref.vcf that are not in test.vcf.

But how would you calculate the True Negatives? I just can't use the number of positions left (too big number!).

indel • 4.2k views
ADD COMMENTlink written 8.4 years ago by Pascal1.5k

Why is the number too big? From my understanding, you have a number of positions that say "nope, no indel here," which is probably the majority of them. For these positions, if there really isn't an indel there, shouldn't that be a true negative? Assuming similar data, you should have mostly true negatives.

ADD REPLYlink written 8.4 years ago by Fwip490

Pascal is correct, the whole number of TN is too large (~3.3e9 for human) such that the figures will be misleading (and drown in rounding error). Therefore it is common practice not to use the standard way of defining specificity like that.

ADD REPLYlink written 8.4 years ago by Michael Dondrup47k
gravatar for Michael Dondrup
8.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

You can use the the Positive Predictive Value (thanks Casey for clearing the definition up).

PPV = TP/(TP + FP)

instead of the Specificity:

Sp = TN/(TN+FP)

This has been used in eukaryote gene-prediction where you have a similar case, if you look for coding-regions on a per nucleotide basis, assuming a vast proportion of the genome is not coding. It has the advantage of avoiding the extremely large TN values leading to close to Sp ~ 1 for most cases.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Michael Dondrup47k

As you probably know, the genome-wide "specificity" value you refer to is more properly called positive predictive value (PPV The (mis)use of the term specificity for PPV causes no end of confusion among students (and researchers). I've found it is better to avoid using the terms sensitivity/specificity and use recall/precision instead, since they are not ambiguously defined.

ADD REPLYlink written 8.4 years ago by Casey Bergman18k

So? Actually I found this definition from the book 'Zvelebil, Understanding Bioinformatics' I will look this definition up tomorrow, and see if they got it right or are themselves source of confusion or if I am. And, 'causes no end of confusion', now you are a bit exaggerating, aren't you? But I will correct it and call it PPV then.

ADD REPLYlink written 8.4 years ago by Michael Dondrup47k

The reason this is important is that terms must be precise to have meaning. I wouldn't doubt it if Zvelebil is wrong on this, it happens in many places. FYI, see wikipedia for the formal classification of performance-related terms:

ADD REPLYlink modified 8 months ago by RamRS27k • written 8.4 years ago by Casey Bergman18k

Sorry if I sounded patronizing, that was not my intent. I could have been more direct, but I was afraid that would have sounded hostile.

ADD REPLYlink written 8.4 years ago by Casey Bergman18k

I have now checked the text in the textbook "Understanding Bioinformatics", 1s edition (2007, maybe corrected by now?), by Zvelebil & Braun. On p. 365 they use the exact misnomer I was reproducing. They propose PPV and introduce it as specificity, while mentioning a standard definition of specificity (same es Sp in my text) without giving references.

ADD REPLYlink written 8.4 years ago by Michael Dondrup47k
gravatar for DG
8.4 years ago by
DG7.1k wrote:

I agree with the comment above, that number really id your True Negative count. And yeah, it will be an absurdly large number depending on your dataset. What you will want to do is look beyond simply calculating sensitivity and specificity. In cases where you have an unbalanced number of entries per class (indel no-indel in this case) you want to start looking at something like the F1-score or the Matthews Correlation Coefficient as a better summary statistic for your comparisons.

Something else to analyze the data is to contruct ROC or Precision-Recall curves so you can see how the specificity and sensitivity are interacting with one another.

ADD COMMENTlink written 8.4 years ago by DG7.1k

Unfortunately, this is incorrect, according to the reasons detailed in the comment.

ADD REPLYlink written 8.4 years ago by Michael Dondrup47k

It is true that it might be good to look at other measures, but it is also true that it is possible to work with them because there is a way around the large counts, and it is good to use such measures as Sp and Se because they are so well established.

ADD REPLYlink written 8.4 years ago by Michael Dondrup47k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1105 users visited in the last hour