Question: Error when fuzzy sequence matching using String::Approx in perl
2
gravatar for Daniel
5.8 years ago by
Daniel3.8k
Cardiff University
Daniel3.8k wrote:

This is a bit complicated but I'm hitting a wall, so please bear with me...

I have written a perl script for searching a reference database with a ~30 base probe. I'm trying to expand it to allow a defined number of mismatches and this led me to String::approx.

The modifiers in the 'aindex' command allow you to define numbers of Insertions, Deletions and Substitutions (Ix,Dx,Sx). If I have these as I0,D0,S0 then I get exact matching as expected but if I use I0,D0,S1 then it matches significantly more than one substitution and I cant understand why

Relevant code snippet:

while (my $seq = $in->next_seq()){
        my $ref = $seq->seq;
        my $new_id = $seq->display_id;

         foreach my $pr (@seqarray){
                my $index = aindex($pr, ["I0","D0","S1"], $ref);
                if ($index != -1){
                        my $match = substr($ref, $index, 30);
                        print "$new_id\t$index\t$match\n";
                        last;
                }
        }

Defining the modifiers as ["I0","D0","S0"] I get:

Species_header    match_position   matching_sequence

Cryptosporidium_15318   813     TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_67285    820     TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_80642    822     TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_86835    822     TAGGGACAGTTGGGGGCATTTGTATTTAAC

 

Defining the modifiers as ["I0","D0","S1"] I get:

Eimeria_10865  51              TGTGTTTATTAGGTACAAAACCAACCCACT
Hepatozoon_12528    866         TAGGGACAGTTGGGGGCATTTGTATTTAAC
Cryptosporidium_15318    809    TTAATAGGGACAGTTGGGGGCATTTGTATT
Cryptosporidium_34679    842    TATTTAACAGCCAGAGGTGAAATTCTTAGA
Hepatozoon_54015    844         TAGGGACAGTTGGGGGCATTTGTATTTAAC

 

These matching sequences are real, and in the reference fasta file, but I don't understand how allowing one substitution is causing this. If anyone has any experience with this I would greatly appreciate it! I have seen the solution that Kenosis gave to Perl :Fuzzy Matching Problem which I might try to implement, but I really want to work out why this is causing such unusual results.

Thanks

UPDATE:

So it looks like changing [S0] to [S1] is causing a 4 character offset. This is what I get when changing nothing but the [S1]:

I think it must be the index being incorrectly defined under substitution conditions.  Here's the data for Eimeria_10865  

---------Zero Substitutions [S0]
probe: TAGGGACAGTTGGGGGCATTTGTATTTAACT
match: TAGGGACAGTTGGGGGCATTTGTATTTAACT

----------One Substitution [S1]
probe: TAGGGACAGTTGGGGGCATTTGTATTTAACT
match: TTAATAGGGACAGTTGGGGGCATTTGTATTT

Eukaryota;Alveolata;Apicomplexa;Coccidia;Eucoccidiorida;Eimeriorina;Eimeriidae;Eimeria;;Eimeria_10865:
TGGTTATTCTACGGCTAATATGTGCGCAATGGCCTCCTTCTCTGGAGGGGCTGTGTTTATTAGGTACAAAACCAACCCACTTTGGTGGACTCTTGGTGATTCATAGTAACCGAACAGATCGCAATTGGCTTGTTGTGGGCACGTTCGATTCCGGAGAGGGAAGCTGAGAAACGGCTACCACATCTAAGGAAGGCAGCAGGCGCGCAAATTACCCAATGAAAACAGTTTCGAGGTAGTGACGAGGAATAACAGTACAGGGCATTTTTTGCTCTGTAATTGGAGTAATGCGAATGRAAGACCCTTTCAGAGTAACAATTGGAGGGCACGTCTGGTGCCCGCAGCCGCTGTAGTTCCAGCTCCAATAGTGCACATTAGAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGTCGTGGTCATCCGGTCGCCCGTATGGGTGTGTGCCTGGTATGACCGCGGCTTTCCTCTGGTAGCCAGCCATCCGCGCTTAATTGCGTGTGTTGGTGTTGAACTAGGACTTTTACTTTGAGAAAAATAGACTGTTTCAAGCAGGCTGGTCGTCCTGAATACTTCAGCATGGAATATGAGGATAGAACCTCGGTTTTATTTTGTTGGTTTTTGGGACCAAGGTATTGATTAATAGGGACAGTTGGGGGCATTTGTATTTAACTGTCAGAGGTGAATTTCTTAGATTTGTTAAAGACGAACTACTGCGACACCATTTGCCAAGGAAGTTTTCATTAATCAAGAACGGTAGCAGGGGGTTCGAGGACGATTAGGCACCGTCGTAATCTCTACCATAAACTATGCGGACTACAGATAGGGAAATGCCTATCTTGGCTTCTCCTGCACCTCATGAGAAGTCGAAGTCTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAGTTGACGAAAGGGCACCACCAGGCCTGGAGCCTGCGGCTTAATTTGACTCCACACGCGGGAACTCACCAGGTCCAGACATGGGAAGGATCGACAGATTGATAGCTCTTTCTTGATTCTATGGGTGGTGGTGCATGGCAGTTTTTAGTTGGTGGAGTGACCTGTCTGGTGAATTTCGATAACGAACGAGACCTTAGCCTGCTTAATAGGATCAAGAACCTCGATTCTCGTATCACTTTATAGAGGGACTTTGCGTGTCCAACGCAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTCAGATGTTCTGTGCTGCACACGCGCTACATTGATGCATGCAACGAGCTTTTGACCTTGGCCGACAGGTCTAGGTAATCTTTTGAGTATGCATCGTGATGGGGGTAGAATATTGCAACTATTAATCT

fuzzy matching perl • 2.0k views
ADD COMMENTlink modified 5.7 years ago by SES8.3k • written 5.8 years ago by Daniel3.8k

Could you give a single reproducible example by giving the query string for Eimeria_10865 and $ref?

ADD REPLYlink written 5.7 years ago by Michael Dondrup47k

I've added to the main post, thanks for looking.

ADD REPLYlink written 5.7 years ago by Daniel3.8k
1

I have played around with this mostly as an exploration of subtle bugs that may exists in 13 or so year old libraries. The behavior is quite curious.

It is the substring TAGGGA of your pattern is what  seems to cause the trouble. If you remove the end of your pattern one character at a time you have to shorten it to 5 characters for it to correctly report the index of the match. If you however remove just one letter from the beginning (just the T) it will report the index correctly for the entire pattern... it is a real headscratcher

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Istvan Albert ♦♦ 83k
2
gravatar for Michael Dondrup
5.7 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

This was reported in 2010 already, status of this bug is still new. Maybe this package shouldn't be used as it is possibly not maintained.  

https://rt.cpan.org/Public/Bug/Display.html?id=57213

ADD COMMENTlink written 5.7 years ago by Michael Dondrup47k

reported by Jeremy Leipzig - that makes sense - bioinformaticians will find the corner cases whereas in natural language processing the usage is less stringent and exhaustive

ADD REPLYlink written 5.7 years ago by Istvan Albert ♦♦ 83k

Thanks, I hadn't seen that. I'll put this in the "Don't touch with a barge pole" pile. I'll stick with my brute force approach.

ADD REPLYlink written 5.7 years ago by Daniel3.8k
0
gravatar for SES
5.7 years ago by
SES8.3k
Vancouver, BC
SES8.3k wrote:

This behavior may be due to the sequence lengths, and the bug report linked to in the other post suggests the same. From the documentation, "...String::Approx is not well-suited for comparing strings of different length, in other words, if you want a "fuzzy eq", see above. String::Approx is more like regular expressions or index(), it finds substrings that are close matches." In other words, you are seeing the expected behavior, if you read the documentation :). Put another way, it is as if the documentation states you should not do a comparison of A vs. B with the module because it is not designed for that, and you file a bug report saying, "hey, I did a comparison of A vs. B and I didn't get what I expected!" 

Seriously though, no one reads the docs closely because hopefully a module will work as advertised (well, like you think it should). In cases like this where you are seeing strange results, that is a good time to read through the docs more closely. Also, you can see more appropriate modules referenced in the documentation for String::Approx.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by SES8.3k

I am not sure that this explains it.

The original poster's use case appears to be what the docs say:  "String::Approx is more like regular expressions". The OP is trying to find substrings that are close matches just as the docs say.

I believe that the documentation tries to discourage use cases where one would want to compute the edit distance between two strings of wildly different lengths.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Istvan Albert ♦♦ 83k

I think this does explain it, at least in part. If you look at the bug report, those sequences are not even close in length. Also, OP says, "..searching a reference database with a ~30 base probe." Is the database composed of ~30 bp sequences? That part is not completely clear to me, but I think this is a relevant point for consideration until we know more.

ADD REPLYlink written 5.7 years ago by SES8.3k
1

The problem may have started in a more complicated program but it can be reduced to taking two strings and matching one onto the other.  The documentation actually states that this is how one should be using the library as a regular expression pattern matching, to find out if the pattern matches and where the match occurs.  I guess it is possible that correct results are only guaranteed if the strings are of equal lengths but that would be even worse prospect regarding its utility.

Just while we are at it stating that the library is not well suited for a use case should never  imply that the outcome will be incorrect, only that it may be inefficient, or maybe it can't find the best solution, or maybe it can't find the pattern at all. These are all fine outcomes.

Producing completely incorrect result that does not conform at all to the input parameters better be a bug.

 

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Istvan Albert ♦♦ 83k
I agree that these results are due to a bug and I edited my post. I was just trying to point out a likely cause that is explained in the documentation. I'll look at the code when I get a chance but I'm currently in the process of moving.
ADD REPLYlink written 5.7 years ago by SES8.3k

I see two scenarios here, either it is a bug, or it is the intended behaviour and  the package is not very useful (at least not in bioinformatics) and the documentation is not clear, it says "is not well-suited for comparing strings of different length" it doesn't say "will give you wrong coordinates if you use aindex" . Most relevant text matching applications that I can think of involve strings of different length. You note that this is a contradiction to the previous claim: "String::Approx is more like regular expressions or index(), it finds substrings that are close matches."?

Bug or not, the maintainer could have taken care of the report, if that doesn't happen I must assume  that the package is unmaintained, or not?

 

ADD REPLYlink written 5.7 years ago by Michael Dondrup47k
The wording is vague I agree, and it would clarify the issue if someone responded to the bug report and said this is not the intended use case, or this is a bug. However, I read those caveats as saying that if you want those matches, then use a more appropriate module. I didn't read it as just there are faster ways, that was not the implication that I took away.
ADD REPLYlink written 5.7 years ago by SES8.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1955 users visited in the last hour