Question: Perl Script Dies When Processing Large Datafiles. Is It A Perl Buffering Issue?
0
gravatar for dannyjmh
7.7 years ago by
dannyjmh20
dannyjmh20 wrote:

Hey everyone! I think i'm having a buffering issue since i need to read and parse big text files (created by myself in previous lines of the code) to finally print things in another file. At some point, after reading a file with 90855 lines, the script is not reading a line of the next file completely. I have counted the number of characters read until this happens: 233467, and therefore tried to flush the buffer and sleep before reading the next line of the file. Doesn't work. Any suggestion, please? thanks a lot. The part of the code coming:

for my $o (0..1){
  if ($o==0){
    @files = reverse <*_SITES_3utr>;
  }else{
    @files = reverse <*_SITES_cds>;
  }
undef(%pita_sites_nu);undef(%pita_tot_score);my($comp_p);undef(%allowed_wobbles);#undef(%site_nu);
foreach $i(@files){
   my $buff=0;
  print "Analyzing $i\n";sleep(1);
  $program= $1 if $i=~ /(\w+)_SITES/;
  open(FIL, $i) or die "$!: $i\n";
  while(<FIL>){

    $buff += length($_); if ($buff >= 230000){$buff=0;sleep(1);select((select(FIL), $|=1)[0]);} #FLUSH THE BUFFER, NOT WORKING!!!

    undef($a);
    unless($.== 1){
      if ($o==0){
        if (/^\d+\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S+)\t(\S+)\t(.*)/){
          $mirna= $1; $target= $2; $start= $3; $end= $4; $site= $5; $comp_p= $6;$a= $7;$j= "${mirna}_${target}_${start}_$end";
          $site_nu{$j}= "$mirna\t$target\t$start\t$end\t$site\t$comp_p";#Store each site in a hash
        }else{die "$buff characters, in line $.:$_\n"} #DIES HERE!!!
      }else{
        if (/^\d+\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S+)\t(.*)/){
          $mirna= $1; $target= $2; $start= $3; $end= $4; $site= $5;$a= $6;$j= "${mirna}_${target}_${start}_$end";
          $site_nu{$j}= "$mirna\t$target\t$start\t$end\t$site";#Store each site in a hash
        }
      }

Ii dies at the "DIES HERE!!" die, after reading 3413 characters of the second file. Happens because the regex doesn't work since only half of the line is in $_. Help please! Thanks again.

perl • 3.0k views
ADD COMMENTlink modified 5.5 years ago by Biostar ♦♦ 20 • written 7.7 years ago by dannyjmh20
1

Stupid question, but when you look at the line in the second file where your program is dying, does it have the right number of fields? Are they properly delimited? In my experience, Perl is able to handle files that have millions of lines without any special attention to buffering on my part, so I would be very skeptical that your issue lies there.

ADD REPLYlink written 7.7 years ago by Mitch Bekritsky1.2k

hey Mitch. Thanks. No stupid question at all, is well received. Yes, all the data is in the file. In the end, I had to flush the filehandle I was using to write to the files. Before start parsing them. Maybe because I had to open and write to many files earlier in the script I got a buffering problem. I'm new to Perl, so....Thanks so much.

ADD REPLYlink written 7.7 years ago by dannyjmh20
1

My pleasure Danny. I've worked with Perl on and off quite a bit, so I'm happy to help when I can. The only other thing to think about is maybe closing all the files you had open earlier in the script? As for stupid questions, usually when I encounter a programming bug that looks like a fault with the language (e.g. no more buffer), the answer is more likely to be a mistake that I made than exposing the shortcomings of a programming language. In my experience, when it's the people who wrote a programming language and/or cosmic rays magically changing output from a program vs my own mistakes, my own mistakes are always the cause ;)

ADD REPLYlink modified 7.7 years ago • written 7.7 years ago by Mitch Bekritsky1.2k
3
gravatar for Istvan Albert
7.7 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

Your buf variable is an integer and as such it will not cause any type of memory overflow so that is not the problem in the least bit. There is no need to flush it, I am not even sure what that piece of code that you call flushing does, but I am almost certain it is not needed.

You should not exit a program with a "die" error just because a line does not match a regexp.

The correct solution is to split the line by tabs and then investigate the number of columns and their contents.

ADD COMMENTlink written 7.7 years ago by Istvan Albert ♦♦ 85k

Thank you Istvan. I also tried splitting the line and the same thing happened. Yes, flushing input buffer is silly. In the end, I flushed the output filehandle I was using before start parsing the files and problem solved. Thank you so much for the help.

ADD REPLYlink written 7.7 years ago by dannyjmh20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 945 users visited in the last hour