How To Calculate Sensitivity/Selectivity Of An Algorithm That Returns Locations Of Possible Matches?
3
4
Entering edit mode
10.7 years ago
Mike ▴ 40

Hi,

I'm playing around with an exact matching algorithm that takes a long text T (which could be a genome) and a short pattern P (some query sequence) and generates a list of locations where the pattern occurs. I'm trying to extend it to generate locations of possible partial/inexact matches of the pattern. How do I then measure the performance of the algorithm when it comes to accuracy? How do I measure its sensitivity and selectivity?

• 8.7k views
1
Entering edit mode

Do you mean like BLAST?

0
Entering edit mode

do you have a gold standard to evaluate against?

0
Entering edit mode

Do you mean like grep/agrep? ;)

0
Entering edit mode

@niallhaslam: Similar to Blast, yes. @Casey: No, I'm actually looking for something like that, if it exists.

0
Entering edit mode

Mike, if you say similar to blast (a local alig), do you want do a sequence alignment? Note that alignment is different from pattern matching, please specify what you are going to achieve correctly, otherwise you will fail with your implementation attempt anyway. If dealing with global/local-alignments, then your 'gold-standard' is Needleman-Wunsh (global) or Smith-Waterman(local).

0
Entering edit mode
0
Entering edit mode

Thanks for the comments, Michael. My end goal is local alignment. I'm aiming to use the fuzzy matching to essentially give me the initial locations where I can use some method to find good local alignments. (Local DP comes to mind.)

0
Entering edit mode

Also, Michael, I'm concerned about S-W as a "gold standard". Yes it is optimal, but I'd like to run the algorithm on a long (perhaps genome-size) text and small query.

I do hope to have faster performance (although less accurate) with my algorithm though.

5
Entering edit mode
10.7 years ago
ALchEmiXt ★ 1.9k

A classical approach to assess accurateness of algorithms (and also diagnostics tests) is by using ROC curves. The rate of false positives (or negatives) against true positives. The further away from the 45 degree line the better. Actually its the area under the curve that matters. Bigger is "usually" better.

Image Source: WikiPedia

1
Entering edit mode

Nice answer. Heng Li has a summary of ROC curves for next-gen sequencing aligners for the original problem space, with details on the code to generate them: http://lh3lh3.users.sourceforge.net/alnROC.shtml

4
Entering edit mode
10.7 years ago
Neilfws 49k

First, a quick summary of sensitivity, specificity and accuracy; then some thoughts as to whether they apply in your case. I've linked here to Wikipedia entries, which are very good summaries of this topic.

To calculate sensitivity, specificity and accuracy, you require 4 things:

1. True positives (TP). A true positive is an observation, identified by your algorithm, which is a real instance of a feature.
2. False positives (FP). A false positive is when your algorithm identifies an observation as a real instance of the feature but in fact, it is not.
3. True negatives (TN). A true negative is the case where the observation is not a real instance of the feature and your algorithm identifies it as such.
4. False negatives (FN). A false negative is the case where your algorithm does not identify the observation as a real instance of the feature but in fact, it is.

Sensitivity is then:

TP / (TP + FN)


And specificity is:

TN / (TN + FP)


Accuracy does not have a single, concise definition - it's measured in many ways, using various combinations of TP, FP, TN and FN. One measure is:

TP + TN / (TP + FP + TN + FN)


Another metric is positive predictive value, defined as:

TP / (TP + FP)


Your problem then, is to define what constitutes TP, FP, TN and FN. You can see that this is difficult to do when the algorithm is fuzzy pattern matching between strings. A perfect match to a known motif might be a TP, but what about a partial match - is that a TP? Similarly, what is a TN in this case? No matches when a known motif is not present? Clearly, the definition of a match changes depending on the query and target sequences. So I don't think that specificity and sensitivity are very useful measures in your case.

Since what you are doing sounds very similar to BLAST (and other pattern matching algorithms), I suggest you consult the literature and see how performance of those algorithms was evaluated.

0
Entering edit mode

Defining the TP, FP, etc. has been my problem because of the fuzziness. Thank you very much for this answer.

0
Entering edit mode

I would like to add, that in this case there are a lot solutions with TP=1, FP=0. While it might be acceptable to have FP!=0 in a case involving uncertainty (e.g. classification) in pattern matching and sequence alignment it is not.

2
Entering edit mode
10.7 years ago

In addition to what Neil wrote and Casey's comment: Yes there is an easy way of getting a gold standard, use an established algorithm, for exact matching you are competing with: naive matching (iterating over the reference checking each position), Boyer-Moore, Knuth-Morris-Prat. For fuzzy or approximate matching with Bitap variants, e.g. Manber-Wu. A way to generate your test-data is readily accessible in the for example in the unix tools grep (exact, not sure with algorithm it uses, BM?) and agrep (fuzzy). These tools can help you verify that you found all matches and only those.

Now, the problem is, you cannot beat these algorithms in accuracy at all, because they all find all possible matches (TP=100%) and only those (FP=0) (given the implementation is correct of course). Actually, you don't want to have any FP in string matching ever (for exact matching at least it is very easy to check), thus the specificity is always 1. What you could try with your own algorithm for purpose of practicing is to compare space and time complexity with the established ones or trade sensitivity for complexity.