Perl script running slow with big files
4
2
Entering edit mode
6.8 years ago
cvu ▴ 180

Hi all,

I've written one perl script, which compares two vcf files and write output in third file. it is running good with small files, but when i am inputting big files in Gbs, my system becomes slow and it would hang.

I'm using dbSNP vcf file as input which is around 9GB.

Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl.

Any help would be appreciated !!

Thank you !!!!

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use List::MoreUtils;

open (FILE, "<", "dbSNP_in.vcf") or die "failed to open: $!\n"; my @array=(<FILE>); my @CHR; my @location; my @rs; my @ref_n; my @alt_n; foreach (@array) { chomp; my ($chrom, $pos,$id, $ref,$alt, $qual,$filter, $info) = split(/\t/,$_);

push @CHR, $chrom; push @location,$pos;
push @rs, $id; push @ref_n,$ref;
push @alt_n, $alt; } open (FILE1, "<", "trial_rs.vcf") or die "failed to open:$!\n";
my @array1=(<FILE1>);

open (OUT,">trial_output.vcf");
my @columns;

foreach (@array1)
{
chomp;
@columns=split(/\t/, $_); my$i;
for ($i=0;$i<@array; $i++) { if (($columns[0] eq $CHR[$i]) and ($columns[1] eq$location[$i]) and ($columns[3] eq $ref_n[$i]) and ($columns[4] eq$alt_n[$i])) {$columns[2]=$rs[$i];

}

}

print OUT join("\t", @columns), "\n";
}


perl SNP snp Assembly alignment • 9.4k views
0
Entering edit mode

"Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl."

It seems obvious that one needs to post your script for this purpose, doesn't it?

0
Entering edit mode

This is the script

... moved ...

7
Entering edit mode
6.8 years ago

The problem is this:

my @array=(<FILE>); ## slurp a whole Terabyte into RAM

Do not read a file like this, unless it is very small or your memory is huge, instead use:

while (<FILE>) { # read each line, then forget about it

   chomp;

   split /t/;

   ....

}

You are iterating over all entries of a huge array for each line of a huge file. Hashes in Perl provide constant time access to elements given the hash key.  Use a hash of hashes to store your filter table.

You seem to filter a large file by a small file, therefore:

• process the small file first (using while construct), parse the small file into a hash using the location as key of a nested hash
• process the large file as above and look up each of the entries in the hash created before

As a result you will only need as much memory as is required to store the small file.

0
Entering edit mode

I guess I should have refreshed the page before answering, since my answer just echoes what you already wrote!

0
Entering edit mode

Ok i'll try this way.

Thanks a lot Michael Dondrup !!!!!!

3
Entering edit mode
6.7 years ago
Kenosis ★ 1.3k

Consider the following refactoring:

use strict; use warnings; use autodie;

my ( @array, %hash );

open my $FILE, "<", "dbSNP_in.vcf"; while (<$FILE>) {     $hash{"@array[0, 1, 3, 4]"} =$array[2] if @array = split /\t/, $_, 6; } close$FILE;

open my $OUT, ">", "trial_output.vcf"; open my$FILE1, "<", "trial_rs.vcf";

while (<$FILE1>) { my @columns = split /\t/; $columns[2] = $hash{"@columns[0, 1, 3, 4]"} if exists$hash{"@columns[0, 1, 3, 4]"};

    print $OUT join "\t", @columns; } close$OUT; close $FILE1; • autodie is used to trap all i/o errors. • chomp is not needed within either loop. • splitting only the amount needed is faster than splitting all. • An interpolated array slice ("@array[0, 1, 3, 4]", which produces a string comprised of elements 0, 1, 3, 4) is used as a hash key with a corresponding value of array element 2--for use later. • An interpolated array slice, as a hash key, is tested for existence, and $columns[2] is set to that key's corresponding value if that key exists.  Doing this, instead of using four eq statements, should speed up the file processing.

Hope this helps!

0
Entering edit mode

Thanks it helped me a lot !!

0
Entering edit mode

Are you sure chomp is not needed? Way back in time I learned that it is always good and safe to chomp. What sometimes happens to me when I forget to use chomp is that I get the newline character in the last column, this can yield hard to debug errors if you use a string from the last column.

0
Entering edit mode

My apologies for not explaining my claim...

In the first while loop, only the first five fields are used; the remaining are discarded, including the last field which contains the record separator that chomp would remove.  In the second while loop, the record separator is retained and used later when the fields are joined and printed.

2
Entering edit mode
6.8 years ago
Vivek ★ 2.5k

I'd suggest using Tabix to index the larger file and use the tabix perl module to query it for each locus in the smaller VCF file. That's likely the most optimal solution in terms of time and memory.

http://samtools.sourceforge.net/tabix.shtml

1
Entering edit mode
6.7 years ago
ole.tange ★ 4.0k

You profile your code. NYTProf + kcachegrind are good tool to figure out which parts of your code needs attention:

perl -d:NYTProf testme.pl;
nytprofcg;
kcachegrind nytprof.callgrind

When you cannot make that faster, see if you can somehow parallelize the code. Can you divide your input into chunks, compute each chunk individually and merge the output when you are done? This technique is called map-reduce.

For smaller jobs (requiring fewer than 1000 CPUs) Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them can often be of help in doing the chunking of input and running of jobs. You just need to merge the final output.

If you still need more speed, you will need to look at a compiled language such as C++. Again you will use the profiler to figure out which parts of your code need the speedup. And instead of rewriting everything into C++ you can address the small part of the code that is the bottleneck. The good part about this is that much of error handling and corner cases often can be left to Perl.

Traffic: 2525 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.