How can I get the uncompressed size of a BGZIP file ?
2
1
Entering edit mode
3.4 years ago
sacha ★ 2.4k

Reading uncompress file size from a gzip file is easy. The value is stored in the last 4 bytes of the gzip file. But it doesn't with bgzip file ! Any idea how can I get this information without uncompress the file ?

bgzip gzip binary • 3.0k views
ADD COMMENT
2
Entering edit mode
3.4 years ago

Each block of a BGZF file is itself a valid GZIP file. So you can calculate the overall uncompressed size as the total of all those size entries recorded in the last four bytes of each GZIP member file.

It remains to find the end of each BGZF block, but fortunately each block's header contains a BC extra subfield that records the compressed length of the block. Hence you can calculate the total uncompressed size without decompressing anything and only reading about 22 bytes out of each ~64K as follows:

  • Read a BGZF header
  • Use the BC bsize field to seek to the end of the block
  • Read the last few bytes of the block, and add that isize field to your total
  • Repeat until EOF

You can find an implementation of this in C in my bgzfsize.c utility.

ADD COMMENT
0
Entering edit mode
3.4 years ago

If you happen to have the *.gzi index file around (bgzip -i option when compressing), you could read the uncompressed offset of the last block from the index file. This way you'll only need to decompress the last block to find its uncompressed size.

In Bash you could do it for example like this.

last_offset=$( tail -c 8 file.gz.gzi | od --endian=little -An -t u8 )
last_size=$( bgzip -b $offset file.gz | wc -c )
echo $(( $last_offset + $last_size ))

Explained: tail to read the last 8 bytes from the index file which contains the uncompressed offset of the last block and od to print is as an little-endian unsigned 64-bit integer (the format of the index file is described in the bgzip(1) man page). bgzip to uncompress from that offset onwards and wc to count the bytes in the uncompressed output.

If you don't happen to have the *.gzi index file, well, you could recreate it with bgzip -r for an existing bgzipped file, but that of course requires processing the whole (compressed) file...

ADD COMMENT
0
Entering edit mode

Is gzi same as tbi index ?

ADD REPLY
0
Entering edit mode

No, a gzi index file is in a different format than a tbi tabix index file. The gzi file is produced when running bgzip with the -i switch when compressing or with the -r switch when running on an existing bgzipped file.

ADD REPLY
0
Entering edit mode

Ok ! I am asking because tbi is more common than gzi. So, I wonder if I can get this information from this index too.

ADD REPLY
0
Entering edit mode

Unfortunately I am not that familiar with the tabix tbi file format. The format is documented here, and by a quick glance I would assume that in a tbi index for a bgzipped file, all the offsets would be referring to the compressed file, and not the uncompressed content, so the answer would probably then be no.

ADD REPLY
0
Entering edit mode

You don't need to decompress the last chunk of data. You can access the information of the size of the last chunk in the same way you acessed the last offset:

last_size=$(tail -c 32 $file.gz | od --endian=little -N 4 -An -t u4)

Explained: BGZIP files contain an EOF marker, which is just an empty block compressed with GZIP. I happens to be of size 28. Also, the GZIP specification (RFC 1952) states that the last 4 bytes of a block contain the uncompressed size of that block. So get the last 28+4 bytes of the compressed file and decode the first 4 bytes to get the size of the last block.

Or, using only od for reading the file:

last_size=$(od --endian=little -j $(($(stat -c %s $file.gz)-32)) -N 4 -An -t u4 $file.gz)
ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6