Finding All Occurances Of A Substring (Motif) In A Sequence Using The Perl Index Function
3
1
Entering edit mode
8.7 years ago
aswathyseb ▴ 30

I am trying to find all occurance of a substring (motif) in a sequence using index function. It works fine if I hard code the substring and sequence in the program. But when I am trying to read the same from a file (where 1st line is sequence and 2nd line is substring) it returns -1. Does anyone know why this is happening? I assume some file handling problem?

Here is my code:

#!/usr/bin/perl
use strict;
use warnings;

open(FH,$ARGV[0]); my$string = <FH>; #'ATGACAGATCCCTTACACCAGACAGATCGACAGATCGACAGATTCAGACAGATGTAGAACGACAGATTCTCGACAGATGACAGATATGACAGATAGCTTAAGACAGATGGTACCCGACAGATGACAGATATTGACAGATA';
my $char = <FH>; #'GACAGATGA';$string=~s/\n//g;
my $offset = 0; my$result = index($string,$char, $offset); while ($result != -1) {
print "Found $char at$result\n";
$offset =$result + 1;
$result = index($string, $char,$offset);
}

perl bioinformatics • 8.6k views
3
Entering edit mode
8.7 years ago
SES 8.5k

Perl's index() will return -1 if there was no match, even though your code is actually fine. There are couple of bad things you are doing with the file you are trying to open, and that is all I can tell that is wrong. Here are couple of pointers you should always keep in mind.

1. Test if you got the arguments you expect.
2. Test if you could open the file (and then close it when you are done).
3. Don't use bare filehandles.

You don't even have to remember these things because here is a pragma called autodie that will save you a lot of typing. Though, it may not be available on every system.

The code below works for me:

#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use autodie qw(open);

my $usage = "$0 string_patt_file\n";
my $infile = shift or die$usage;

open my $fh, '<',$infile; # same as open ... or die $! my$string = <$fh>; my$char =  <$fh>; close$fh;
my $offset = 0; my$result = index($string,$char, $offset); while ($result != -1) {
say "Found $char at$result";
$offset =$result + 1;
$result = index($string, $char,$offset);
}


The output:

$perl biostars72099.pl string_patt_file Found GACAGATGA at 71 Found GACAGATGA at 115  Note that this only gives you the start of the match, so you'll have to add a little code if you want the full match range. If you use a regular expression approach instead of index() then you can use some of Perl's built-in magic variables to get the match range. It is hard to compare approaches with this simple example, but depending on the size of the string, number of patterns, what information about the matches is desired, etc. using a regular expression may be faster and give you more control. In my test, a regex was slightly faster. I used two different engines, Perl's native engine and RE2, and Perl's regex engine was a little faster than RE2 but that will likely not be true for a larger string. You may want to keep both approaches in mind depending on what you are trying to accomplish. ADD COMMENT 0 Entering edit mode Thank you very much... That was very helpful and informative...:) ADD REPLY 0 Entering edit mode Nice work +1. Consider chomp ing (at least) the substring, in case there's a newline at its end. ADD REPLY 0 Entering edit mode 8.7 years ago Pascal ▴ 250 You remove the new line of the first line, but not of the second. Add$char=~s/\n//g; to remove it and everything should work.

0
Entering edit mode

agree with that

0
Entering edit mode
8.7 years ago
Kenosis ★ 1.3k

Here's another option:

use strict;
use warnings;

chomp( my @strings = <> );
my ( $subStrLen, @found ) = length$strings[1];

push @found, pos( $strings[0] ) -$subStrLen
while $strings[0] =~ /$strings[1]/g;

print "Seq: $strings[0]\n\nSubstring:$strings[1]\n\n";

if (@found) {
local \$" = ', ';
print 'Substring found ' . @found . " time(s) at: @found\n";
}
else {
}


Usage: perl script.pl dataFile [>outFile]

The last, optional parameter directs output to a file.

Seq: ATGACAGATCCCTTACACCAGACAGATCGACAGATCGACAGATTCAGACAGATGTAGAACGACAGATTCTCGACAGATGACAGATATGACAGATAGCTTAAGACAGATGGTACCCGACAGATGACAGATATTGACAGATA

Substring: GACAGATGA

Substring found 2 time(s) at: 71, 115


Since you're sending the file that contains the sequence and substring to the script via the command line, let Perl handle the file i/o. This script reads the two strings into an array, then uses a global match and pos() to calculate substring positons, if any. Starting locations are push ed onto @found and then later displayed.

Hope this helps!