Question: Is Zlib Library Failing To Process Properly Bgzip Files?
6
gravatar for Pablo Marin-Garcia
3.2 years ago by
Spain
Pablo Marin-Garcia1.7k wrote:

bgzip files are backward compatible with gzip, but I have issues when using bgzip compressed vcf files with snpeff (java) or perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete.

perl code that does not work:

#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;

my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";

This gives 419 lines

$ perl /home/pmg/tmp/test_zlib-bgzip.pl 460112_TTAGGC_L005_L006_C3HVJACXX.sorted.rmdup.varsit.vcf.gz
total lines read = 419

but using open with gzip pipe works:

#!/usr/bin/env perl
use strict;

my $infile = shift;
open(my $infh , 'gzip -dc '.$infile.' |');  # I can use bgzip intead gzip
my $line_count = 0;
while (my $line=<$infh>){

    $line_count++
}
print "total lines read = $line_count\n";

Gives the expected number of lines

$ perl /home/pmg/tmp/test_gzip-bgzip.pl 460112_TTAGGC_L005_L006_C3HVJACXX.sorted.rmdup.varsit.vcf.gz
total lines read = 652829

I googled about and I was unable to find quickly any relevant entry, but this is something that I am sure other people would have already faced. Do someone have a clue about why is this happening? I am using ubuntu 12.04.4 with perl 5.16

vcf • 2.4k views
ADD COMMENTlink modified 2.1 years ago by Andy Yates110 • written 3.2 years ago by Pablo Marin-Garcia1.7k
1
gravatar for Andy Yates
2.1 years ago by
Andy Yates110
Cambridge
Andy Yates110 wrote:

Hi. I have just encountered the same problem and it seems there is a pure-Perl solution surrounding this as shown below

my $status = gunzip("in.gz" => "out.gz", MultiStream => 1)
  or die "gunzip failed: $GunzipError\n";

It also seems Apache Commons Compress can provide similar support for Java. Hope this helps

ADD COMMENTlink written 2.1 years ago by Andy Yates110
3
gravatar for lh3
3.2 years ago by
lh329k
United States
lh329k wrote:

See the SAM spec about the bug/feature in the Java library - it only reads the first gzip block. There is also a related thread in BioStar. I guess Perl is the same. On Unix/Linux, I usually use:

open(FH, $ARGV[0] =~ /\.gz$/? "gzip -dc $ARGV[0] |" : $ARGV[0]) || die;
while (<FH>) {...}
ADD COMMENTlink written 3.2 years ago by lh329k

Thanks Heng, I have read many times the SAM specs, even the bgzip part but I forgot the footnote were it explain the kind of bug that has caught me twice this week using snpeff and VEP: "[2]It is worth noting that there is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug."

ADD REPLYlink written 3.2 years ago by Pablo Marin-Garcia1.7k

Thanks for posting this, it will surely save others some time in the future. (The quote is from the bottom of page 12 in the SAM specs, in case anyone is wanting a reference).

ADD REPLYlink written 3.2 years ago by SES7.9k
1
gravatar for SES
3.2 years ago by
SES7.9k
Vancouver, BC
SES7.9k wrote:

I'm not sure this will fix the bug/feature causing the issue, but the best way to read this type of data in Perl would be to use the gzip IO layer.

use PerlIO::gzip;

open my $fh, "<:gzip", "file.gz" or die $!;
print while <$fh>; # And it will be uncompressed...

That isn't in Perl core but I think it's a favorite trick of Perl developers. If you don't have that, I usually just use zcat:

open my $fh, '-|', 'zcat', $file or die $!;
print while <$fh>; # And it will be uncompressed...

This list form allows you to pass arguments without metacharacters being interpreted by the shell, but I'm pretty sure this form is not allowed on Windows. If the solution you mentioned works, this second approach will certainly work.

ADD COMMENTlink written 3.2 years ago by SES7.9k

I have not used PerlIO::gzip but if it uses the zlib library probably would have the same problem. Calling gzip as a pipe works OK.

ADD REPLYlink written 3.2 years ago by Pablo Marin-Garcia1.7k
1

You are correct. I actually tested PerlIO::gzip with some 1000 genomes data and it doesn't work, though the second solution does work (and is probably safer). For general usage though, PerlIO::gzip is awesome because it much easier than typing out all that IO::Uncompress stuff or using pipes and it's also more Perlish. In terms of speed, there is no difference between using a pipe with "gzip -dc" or the second solution I posted above.

ADD REPLYlink written 3.2 years ago by SES7.9k

When I was pondering which gzip implementation to use, I read that PerlIO is faster but I decided for IO:: because is core and I have put the code in a method so I only needed write once all this verbosity. Regarding the pipe part, I always use the three parts open like yours. Is easy to write the code for reading or writting to the pipe BUT I was not able to find out how to write to the pipe AND redirect to a file like "open my $out, '| gzip -c > ' . $filename;", so I use both compress and decompress as two arguments open for symetry

ADD REPLYlink written 3.2 years ago by Pablo Marin-Garcia1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1418 users visited in the last hour