Question: Perl: Ignoring Repetitive K-Mers In Hash Construction
1
gravatar for Neal
5.9 years ago by
Neal40
Norway
Neal40 wrote:

Hello all,

I am trying to find the most frequent k-mers with "n" mismatches. But I'm running into a problem while creating a hash table of such k-mers.

Sample Input:ACGTTGCATGTCGCATGATGCATGAGAGCT
k-mer size:4
mismatch:1
Output: GATG ATGC ATGT

My code's logic is as follows:

1. Extract all kmers of given size
2. For each kmer, find other kmers with given mismatch range
3. Print kmers which have maximum number of fuzzy matches

I am specifically stuck on point 2 above. I am using a while loop to iterate over all kmers. Problem is I need to count a particular kmer only once eg :

ACGT CGTT GTTG TTGC TGCA GCAT CATG ATGT TGTC GTCG TCGC CGCA GCAT CATG **ATGA** TGAT GATG ATGC TGCA GCAT CATG **ATGA** TGAG GAGA AGAG GAGC AGCT

a kmer like ATGA should be counted just once (it appears twice as I've highlighted above with ** for ease of viewing). While constructing the hash, it gets counted twice and all the fuzzy matches also get doubled. Similarly GCAT occurs thrice above, and while constructing the hash, it gets counted thrice.

I need a way to skip if the kmer already exists in the hash ( kmer being the key).

The brilliant code for fuzzy matching via subroutine was provided by Kenosis here: Perl :Fuzzy Matching Problem

My code is as under:

use 5.014;
use warnings;
use List::Util qw/max/;

$_ = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
my $len_seq = length $_;
my $len_pat = 4;

my $i = 0;
my @all_kmer;

#Extract all kmers of given size
while($i <= $len_seq - $len_pat){
    my $substr = substr $_, $i, $len_pat;
    push @all_kmer, $substr;

    $i++
}


#Initialize hash of kmer counts    
my %counts;
my $j = 0;
while ($all_kmer[$j]){

foreach (@all_kmer){
        push( @{$counts{$all_kmer[$j]}},$_  ) if strDiffMaxDelta ($all_kmer[$j], $_, 1);

        }


    $j++;

    #Skip if key already exists #Need Proper Code for this
    if(exists $counts{$all_kmer[$j]} ){
        say "Key already exists for $all_kmer[$j]";
        $j=$j+3; # Need corrections / a way to skip to next element in @all_kmer
    }
 }

#Count fuzzy matches for each key in hash
my @fuzzy_kmer_counts;                                                                    
foreach my $key (keys %counts){
     push @fuzzy_kmer_counts, scalar @{$counts{$key}};  

}

#Determine maximum number of fuzzy matches
my $max = max(@fuzzy_kmer_counts);

#Print key(s) which have maximum number of fuzzy matches
foreach my $key (keys %counts){
     say "Key: $key\t" if scalar @{$counts{$key}} == $max;
     print OUT $key,"\t" if scalar @{$counts{$key}} == $max;
}


###########################
#Subroutine
###########################
sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}
perl • 2.9k views
ADD COMMENTlink modified 5.9 years ago by Michael Dondrup46k • written 5.9 years ago by Neal40
1

Just to be clear, is the expected output from your script the "GATG ATGC ATGT" above, or something different?

ADD REPLYlink written 5.9 years ago by Kenosis1.2k

Hi Kenosis, thanks for your comment. The expected output is indeed "GATG ATGC ATGT" for the given sample input of "ACGTTGCATGTCGCATGATGCATGAGAGCT ", Each of these kmers,with 1 mismatch. has 5 fuzzy matches. Other kmers have fewer fuzzy matches.

ADD REPLYlink written 5.9 years ago by Neal40

Hi Kenosis, Michael's code snippet solves the logic where I was stuck. I had to check the existence of the key before assigning the values in arrays.

ADD REPLYlink written 5.9 years ago by Neal40
1
gravatar for Michael Dondrup
5.9 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Hm, maybe something like this, replacing your two nested loops?

 I: foreach my $i (@all_kmer) {
         next I if exists $counts{$i};
    J:  foreach my $j (@all_kmer) {
                    push( @{$counts{$i}, $j  ) if strDiffMaxDelta ($i, $j, 1);

         }
  }
ADD COMMENTlink written 5.9 years ago by Michael Dondrup46k

Hi Michael! Thanks for your input! It works perfectly! Just one small query, would it be possible to explain what the I: and J: mean and would the loops falter without them?

ADD REPLYlink written 5.9 years ago by Neal40
1

I: and J: are labels, they allow to e.g. jump out of the inner loop with next I; or out of any loop level. The construct works fine without, using just next, but I have started to label all my loops with sensible names for annotation, like in next LINE if /^#/.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Michael Dondrup46k
1

I just noticed that the reason for the duplication is that you compute strDiffMaxDelta for all A:B and also for B:A. Then a solution as outlined here A: Comparing alignments in an iterative process with Perl is more efficient because it avoids the symmetric case.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 980 users visited in the last hour