Question: Perl :Fuzzy Matching Problem
2
gravatar for Neal
7.1 years ago by
Neal50
Norway
Neal50 wrote:

Hello all, I am looking to implement fuzzy matching in Perl using the module

use String::Approx

I need 3 mismatches. However, I am unable to implement the modifiers appropriately to get the desired outcome.

Input string:CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
Pattern ATTCTGGA
Output 6 7 26 27

So basically, I need the locations where the pattern matches input string approximately ( 3 mismatches)

My Code is as under:

use 5.014;
use warnings;
use String::Approx qw/amatch aindex/;


$_ = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "ATTCTGGA";
my $len_in = length $_;
my $len_pat = length $pat;


my $i = 0;
my @pattern;


while($i <= $len_in-$len_pat){ 
    my $substr = substr $_, $i, $len_pat;#Extract k-mer of length = $len_pat from position $i
    my $posn = index  $_,$substr, $i;#Extract start position of k-mer

    push @pattern, $substr if amatch ("$pat", ["S3"], $substr); #Push into @pattern if the extracted k-mer matches $pat by 3 substitutions 
    push @posn, $posn if amatch ("$pat", ["S3"], $substr); #Push into @posn the position, if the extracted k-mer matches $pat by 3 substitutions 

    $i++
}

say "@pattern";
say "@posn";

I would be grateful if it could be pointed out where I am going wrong. Right now, I get no output at all.

perl • 4.6k views
ADD COMMENTlink modified 18 months ago by Biostar ♦♦ 20 • written 7.1 years ago by Neal50
7
gravatar for Kenosis
7.1 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

You can calculate the "3 mismatches" between strings using Perl's exclusive-or operator (^) between those two strings, as it returns \x00 for each matching pair of characters, and a different value for non-matching characters. Given this, consider the following solution:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "ATTCTGGA";
my $len_in  = length $seq;
my $len_pat = length $pat;

my %posn_pattern;

for my $i ( 0 .. $len_in - $len_pat ) {

    #Extract k-mer of length = $len_pat from position $i
    my $substr = substr $seq, $i, $len_pat;

    #Save position/pattern if the extracted k-mer matches $pat by 3 or fewer substitutions
    $posn_pattern{$i} = $substr if strDiffMaxDelta( $pat, $substr, 3 );
}

say "$_ => $posn_pattern{$_}" for sort { $a <=> $b } keys %posn_pattern;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

Results:

6 => AATCCAGA
7 => ATCCAGAA
26 => ATTTCGGG
27 => TTTCGGGA

The strDiffMaxDelta subroutine counts the non-\x00 characters which indicates the number of character differences between the two sent strings, and then returns the result of the comparison of that number against the max delta (in this case 3). Of course, you can use $diffCount == $maxDelta if you want to return strings which have exactly three mismatches--in case I've misunderstood your desired results.

Hope this helps!

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Kenosis1.2k

Now that is a new one for me. Nice!

ADD REPLYlink written 7.1 years ago by Chris Fields2.1k

Many thanks Kenosis and exactly what was needed! This method is completely new for me! Would you be able to please explain this line of code my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;. Specifically what does () perform? I understand ^ is comparing $s1 and $s2. Also, it would be great if a resource explaining \x00 could be pointed out. I really would love to understand the logic. It's a brilliant code!

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Neal50
1

You're most welcome, nilan666! Am glad it worked for you.

Here's an explaination of that line:

my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
                 ^         ^            ^      ^
                 |         |            |      |
                 |         |            |      + - Do so globally
                 |         |            + - Match the XOR results for non-\x00 which indicates a character-pair difference
                 |         + - XOR the two strings against each other
                 + - 'Goatse' operator "=()="
  • \x00 is the hexidecimal value for the NULL character, and is the XOR operation's result when the two strings' characters are identical.
  • Goatse operator "=()=": right expression's evaluated in list context, w/results assigned to empty array "()="; that array is evaluated in scalar context "=()" w/results assigned to $diffCount
ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Kenosis1.2k
1

Erm, Goatse operator.... wow. DO NOT GOOGLE THIS TERM WITHOUT THE 'OPERATOR' TERM ON THE END. Images would be deemed NSFL (Not Safe For Life)

ADD REPLYlink written 7.1 years ago by Daniel3.8k

But this is an awesome bit of code, thank you.

ADD REPLYlink written 7.1 years ago by Daniel3.8k

So if I understand correctly, using ^ , if $s1 and $s2 are identical \x00 would be true and $diffCount would be 0; conversely when $s1 and $s2 are different, ^\x00 stores the the character difference which gets assigned to $diffcount. This is awesome stuff!

ADD REPLYlink written 7.1 years ago by Neal50
1

Hi nilan666.

Perl's XOR on strings, e.g., $s1 ^ $s2, returns \x00 for each identical character-pair at position n in those two strings. So, 'Perl' ^ 'Perl' would return '\x00\x00\x00\x00', since those two strings are identical. The regex /[^\x00]/g globally matches only those characters in the XOR results which are not \x00. The result of globally matching the non-\x00 characters in the XOR results is assigned to $diffcount above, as it indicates the number of different characters when the two strings are compared character-by-character.

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Kenosis1.2k

Ah now I understand it much more clearly!! I reiterate that this code is sheer brilliance! By the way, I have used your subroutine in another code, but I am stuck on a particular logical challenge of that code. I would be grateful for your time if you could have a look here :Perl: Ignoring repetitive k-mers in hash construction

ADD REPLYlink written 7.1 years ago by Neal50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour