Question: Perl: Ignoring Repetitive K-Mers In Hash Construction
1
Neal50 wrote:

Hello all,

I am trying to find the most frequent k-mers with "n" mismatches. But I'm running into a problem while creating a hash table of such k-mers.

Sample Input:ACGTTGCATGTCGCATGATGCATGAGAGCT
k-mer size:4
mismatch:1
Output: GATG ATGC ATGT

My code's logic is as follows:

``````1. Extract all kmers of given size
2. For each kmer, find other kmers with given mismatch range
3. Print kmers which have maximum number of fuzzy matches
``````

I am specifically stuck on point 2 above. I am using a `while` loop to iterate over all kmers. Problem is I need to count a particular kmer only once eg :

``````ACGT CGTT GTTG TTGC TGCA GCAT CATG ATGT TGTC GTCG TCGC CGCA GCAT CATG **ATGA** TGAT GATG ATGC TGCA GCAT CATG **ATGA** TGAG GAGA AGAG GAGC AGCT
``````

a kmer like ATGA should be counted just once (it appears twice as I've highlighted above with ** for ease of viewing). While constructing the hash, it gets counted twice and all the fuzzy matches also get doubled. Similarly GCAT occurs thrice above, and while constructing the hash, it gets counted thrice.

I need a way to `skip if the kmer already exists in the hash ( kmer being the key).`

The brilliant code for fuzzy matching via subroutine was provided by Kenosis here: Perl :Fuzzy Matching Problem

My code is as under:

``````use 5.014;
use warnings;
use List::Util qw/max/;

\$_ = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
my \$len_seq = length \$_;
my \$len_pat = 4;

my \$i = 0;
my @all_kmer;

#Extract all kmers of given size
while(\$i <= \$len_seq - \$len_pat){
my \$substr = substr \$_, \$i, \$len_pat;
push @all_kmer, \$substr;

\$i++
}

#Initialize hash of kmer counts
my %counts;
my \$j = 0;
while (\$all_kmer[\$j]){

foreach (@all_kmer){
push( @{\$counts{\$all_kmer[\$j]}},\$_  ) if strDiffMaxDelta (\$all_kmer[\$j], \$_, 1);

}

\$j++;

#Skip if key already exists #Need Proper Code for this
if(exists \$counts{\$all_kmer[\$j]} ){
say "Key already exists for \$all_kmer[\$j]";
\$j=\$j+3; # Need corrections / a way to skip to next element in @all_kmer
}
}

#Count fuzzy matches for each key in hash
my @fuzzy_kmer_counts;
foreach my \$key (keys %counts){
push @fuzzy_kmer_counts, scalar @{\$counts{\$key}};

}

#Determine maximum number of fuzzy matches
my \$max = max(@fuzzy_kmer_counts);

#Print key(s) which have maximum number of fuzzy matches
foreach my \$key (keys %counts){
say "Key: \$key\t" if scalar @{\$counts{\$key}} == \$max;
print OUT \$key,"\t" if scalar @{\$counts{\$key}} == \$max;
}

###########################
#Subroutine
###########################
sub strDiffMaxDelta {
my ( \$s1, \$s2, \$maxDelta ) = @_;
my \$diffCount = () = ( \$s1 ^ \$s2 ) =~ /[^\x00]/g;
return \$diffCount <= \$maxDelta;
}
``````
perl • 3.2k views
modified 7.1 years ago by Michael Dondrup48k • written 7.1 years ago by Neal50
1

Just to be clear, is the expected output from your script the "GATG ATGC ATGT" above, or something different?

Hi Kenosis, thanks for your comment. The expected output is indeed "GATG ATGC ATGT" for the given sample input of "ACGTTGCATGTCGCATGATGCATGAGAGCT ", Each of these kmers,with 1 mismatch. has 5 fuzzy matches. Other kmers have fewer fuzzy matches.

Hi Kenosis, Michael's code snippet solves the logic where I was stuck. I had to check the existence of the key before assigning the values in arrays.

1
Michael Dondrup48k wrote:

Hm, maybe something like this, replacing your two nested loops?

`````` I: foreach my \$i (@all_kmer) {
next I if exists \$counts{\$i};
J:  foreach my \$j (@all_kmer) {
push( @{\$counts{\$i}, \$j  ) if strDiffMaxDelta (\$i, \$j, 1);

}
}
``````

Hi Michael! Thanks for your input! It works perfectly! Just one small query, would it be possible to explain what the `I:` and `J:` mean and would the loops falter without them?

1

I: and J: are labels, they allow to e.g. jump out of the inner loop with `next I;` or out of any loop level. The construct works fine without, using just `next`, but I have started to label all my loops with sensible names for annotation, like in `next LINE if /^#/`.

1

I just noticed that the reason for the duplication is that you compute strDiffMaxDelta for all A:B and also for B:A. Then a solution as outlined here A: Comparing alignments in an iterative process with Perl is more efficient because it avoids the symmetric case.