Question: Perl script running slow with big files
2
gravatar for cvu
4.9 years ago by
cvu140
India
cvu140 wrote:

Hi all,

I've written one perl script, which compares two vcf files and write output in third file. it is running good with small files, but when i am inputting big files in Gbs, my system becomes slow and it would hang.

I'm using dbSNP vcf file as input which is around 9GB.

Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl.

Any help would be appreciated !!

Thank you !!!!


 

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use List::MoreUtils;


open (FILE, "<", "dbSNP_in.vcf") or die "failed to open: $!\n";
my @array=(<FILE>);
my @CHR;
my @location;
my @rs;
my @ref_n;
my @alt_n;
foreach (@array)
{
    chomp;
    my ($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split(/\t/, $_);

push @CHR, $chrom;
push @location, $pos;
push @rs, $id;
push @ref_n, $ref;
push @alt_n, $alt;
}


open (FILE1, "<", "trial_rs.vcf") or die "failed to open: $!\n";
my @array1=(<FILE1>);


open (OUT,">trial_output.vcf");
my @columns;

foreach (@array1)
{
    chomp;
    @columns=split(/\t/, $_);
    my $i;
    for ($i=0; $i<@array; $i++)
    {
        if (($columns[0] eq $CHR[$i]) and ($columns[1] eq $location[$i]) and ($columns[3] eq $ref_n[$i]) and ($columns[4] eq $alt_n[$i]))
        {
            $columns[2]=$rs[$i];
            

        }

    }

print OUT join("\t", @columns), "\n";
}


snp alignment assembly perl • 7.3k views
ADD COMMENTlink modified 4.9 years ago by ole.tange3.4k • written 4.9 years ago by cvu140

"Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl."

It seems obvious that one needs to post your script for this purpose, doesn't it?

ADD REPLYlink written 4.9 years ago by Michael Dondrup46k

This is the script

... moved ...

ADD REPLYlink modified 4.9 years ago by Michael Dondrup46k • written 4.9 years ago by cvu140
7
gravatar for Michael Dondrup
4.9 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

The problem is this:

my @array=(<FILE>); ## slurp a whole Terabyte into RAM

Do not read a file like this, unless it is very small or your memory is huge, instead use:

while (<FILE>) { # read each line, then forget about it

   chomp;

   split /t/;

   ....

}

In addition your code is extremely inefficient:

You are iterating over all entries of a huge array for each line of a huge file. Hashes in Perl provide constant time access to elements given the hash key.  Use a hash of hashes to store your filter table.

You seem to filter a large file by a small file, therefore: 

  • process the small file first (using while construct), parse the small file into a hash using the location as key of a nested hash
  • process the large file as above and look up each of the entries in the hash created before

As a result you will only need as much memory as is required to store the small file. 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Michael Dondrup46k

I guess I should have refreshed the page before answering, since my answer just echoes what you already wrote!

ADD REPLYlink written 4.9 years ago by Devon Ryan90k

Ok i'll try this way.

Thanks a lot Michael Dondrup !!!!!!

ADD REPLYlink written 4.9 years ago by cvu140
3
gravatar for Kenosis
4.9 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Consider the following refactoring:

use strict;
use warnings;
use autodie;

my ( @array, %hash );

open my $FILE, "<", "dbSNP_in.vcf";

while (<$FILE>) {
    $hash{"@array[0, 1, 3, 4]"} = $array[2] if @array = split /\t/, $_, 6;
}

close $FILE;

open my $OUT,   ">", "trial_output.vcf";
open my $FILE1, "<", "trial_rs.vcf";

while (<$FILE1>) {
    my @columns = split /\t/;

    $columns[2] = $hash{"@columns[0, 1, 3, 4]"}
      if exists $hash{"@columns[0, 1, 3, 4]"};

    print $OUT join "\t", @columns;
}

close $OUT;
close $FILE1;

  • autodie is used to trap all i/o errors.
  • chomp is not needed within either loop.
  • splitting only the amount needed is faster than splitting all.
  • An interpolated array slice ("@array[0, 1, 3, 4]", which produces a string comprised of elements 0, 1, 3, 4) is used as a hash key with a corresponding value of array element 2--for use later.
  • An interpolated array slice, as a hash key, is tested for existence, and $columns[2] is set to that key's corresponding value if that key exists.  Doing this, instead of using four eq statements, should speed up the file processing.

Hope this helps!

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Kenosis1.2k

Thanks it helped me a lot !!

ADD REPLYlink written 4.9 years ago by cvu140

Are you sure chomp is not needed? Way back in time I learned that it is always good and safe to chomp. What sometimes happens to me when I forget to use chomp is that I get the newline character in the last column, this can yield hard to debug errors if you use a string from the last column.

ADD REPLYlink written 4.9 years ago by Michael Dondrup46k

My apologies for not explaining my claim...

In the first while loop, only the first five fields are used; the remaining are discarded, including the last field which contains the record separator that chomp would remove.  In the second while loop, the record separator is retained and used later when the fields are joined and printed.

ADD REPLYlink written 4.9 years ago by Kenosis1.2k
2
gravatar for Vivek
4.9 years ago by
Vivek2.3k
Denmark
Vivek2.3k wrote:

I'd suggest using Tabix to index the larger file and use the tabix perl module to query it for each locus in the smaller VCF file. That's likely the most optimal solution in terms of time and memory.

http://samtools.sourceforge.net/tabix.shtml

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Vivek2.3k
1
gravatar for ole.tange
4.9 years ago by
ole.tange3.4k
Denmark
ole.tange3.4k wrote:

There are better answers that address your specific issue. But they do not answer the general problem: What do you do if your perl script is slow?

You profile your code. NYTProf + kcachegrind are good tool to figure out which parts of your code needs attention:

perl -d:NYTProf testme.pl; 
nytprofcg; 
kcachegrind nytprof.callgrind

When you cannot make that faster, see if you can somehow parallelize the code. Can you divide your input into chunks, compute each chunk individually and merge the output when you are done? This technique is called map-reduce.

For smaller jobs (requiring fewer than 1000 CPUs) Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them can often be of help in doing the chunking of input and running of jobs. You just need to merge the final output.

If you still need more speed, you will need to look at a compiled language such as C++. Again you will use the profiler to figure out which parts of your code need the speedup. And instead of rewriting everything into C++ you can address the small part of the code that is the bottleneck. The good part about this is that much of error handling and corner cases often can be left to Perl.

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by ole.tange3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1699 users visited in the last hour