Question

Finding All Occurances Of A Substring (Motif) In A Sequence Using The Perl Index Function

1

Entering edit mode

10.9 years ago

aswathyseb ▴ 30

I am trying to find all occurrence of a substring (motif) in a sequence using index function. It works fine if I hard code the substring and sequence in the program. But when I am trying to read the same from a file (where 1st line is sequence and 2nd line is substring) it returns -1. Does anyone know why this is happening? I assume some file handling problem?

Here is my code:

#!/usr/bin/perl
use strict;
use warnings;

open(FH,$ARGV[0]);
my $string = <FH>; #'ATGACAGATCCCTTACACCAGACAGATCGACAGATCGACAGATTCAGACAGATGTAGAACGACAGATTCTCGACAGATGACAGATATGACAGATAGCTTAAGACAGATGGTACCCGACAGATGACAGATATTGACAGATA';
my $char =  <FH>;
#'GACAGATGA';
$string=~s/\n//g;
my $offset = 0;

my $result = index($string, $char, $offset);
while ($result != -1) {
    print "Found $char at $result\n";
    $offset = $result + 1;
    $result = index($string, $char, $offset);
}

perl • 9.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 10.9 years ago by aswathyseb ▴ 30

score 3 · Answer 1 · 2013-05-19

Perl's index() will return -1 if there was no match, even though your code is actually fine. There are couple of bad things you are doing with the file you are trying to open, and that is all I can tell that is wrong. Here are couple of pointers you should always keep in mind.

Test if you got the arguments you expect.
Test if you could open the file (and then close it when you are done).
Don't use bare filehandles.

You don't even have to remember these things because here is a pragma called autodie that will save you a lot of typing. Though, it may not be available on every system.

The code below works for me:

#!/usr/bin/env perl                                                                                                                     
use 5.010;
use strict;
use warnings;
use autodie qw(open);

my $usage = "$0 string_patt_file\n"; 
my $infile = shift or die $usage;

open my $fh, '<', $infile; # same as open ... or die $!
my $string = <$fh>;
my $char =  <$fh>;
close $fh;
my $offset = 0;

my $result = index($string, $char, $offset);
while ($result != -1) {
    say "Found $char at $result";
    $offset = $result + 1;
    $result = index($string, $char, $offset);
}

The output:

$ perl biostars72099.pl string_patt_file
Found GACAGATGA at 71
Found GACAGATGA at 115

Note that this only gives you the start of the match, so you'll have to add a little code if you want the full match range. If you use a regular expression approach instead of index() then you can use some of Perl's built-in magic variables to get the match range. It is hard to compare approaches with this simple example, but depending on the size of the string, number of patterns, what information about the matches is desired, etc. using a regular expression may be faster and give you more control. In my test, a regex was slightly faster. I used two different engines, Perl's native engine and RE2, and Perl's regex engine was a little faster than RE2 but that will likely not be true for a larger string. You may want to keep both approaches in mind depending on what you are trying to accomplish.

score 0 · Answer 2 · 2013-05-19

0

Entering edit mode

10.9 years ago

Pascal ▴ 250

You remove the new line of the first line, but not of the second. Add $char=~s/\n//g; to remove it and everything should work.

ADD COMMENT • link 10.9 years ago by Pascal ▴ 250

0

Entering edit mode

agree with that

ADD REPLY • link 10.9 years ago by ewre ▴ 250

score 0 · Answer 3 · 2013-05-19

Here's another option:

use strict;
use warnings;

chomp( my @strings = <> );
my ( $subStrLen, @found ) = length $strings[1];

push @found, pos( $strings[0] ) - $subStrLen
  while $strings[0] =~ /$strings[1]/g;

print "Seq: $strings[0]\n\nSubstring: $strings[1]\n\n";

if (@found) {
    local $" = ', ';
    print 'Substring found ' . @found . " time(s) at: @found\n";
}
else {
    print "Substring not found.\n";
}

Usage: perl script.pl dataFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

Seq: ATGACAGATCCCTTACACCAGACAGATCGACAGATCGACAGATTCAGACAGATGTAGAACGACAGATTCTCGACAGATGACAGATATGACAGATAGCTTAAGACAGATGGTACCCGACAGATGACAGATATTGACAGATA

Substring: GACAGATGA

Substring found 2 time(s) at: 71, 115

Since you're sending the file that contains the sequence and substring to the script via the command line, let Perl handle the file i/o. This script reads the two strings into an array, then uses a global match and pos() to calculate substring positons, if any. Starting locations are push ed onto @found and then later displayed.

Hope this helps!