Each block of a BGZF file is itself a valid GZIP file. So you can calculate the overall uncompressed size as the total of all those size entries recorded in the last four bytes of each GZIP member file.
It remains to find the end of each BGZF block, but fortunately each block's header contains a BC extra subfield that records the compressed length of the block. Hence you can calculate the total uncompressed size without decompressing anything and only reading about 22 bytes out of each ~64K as follows:
- Read a BGZF header
- Use the BC bsize field to seek to the end of the block
- Read the last few bytes of the block, and add that isize field to your total
- Repeat until EOF
You can find an implementation of this in C in my bgzfsize.c utility.
If you happen to have the *.gzi index file around (bgzip
-i option when compressing), you could read the uncompressed offset of the last block from the index file. This way you'll only need to decompress the last block to find its uncompressed size.
In Bash you could do it for example like this.
last_offset=$( tail -c 8 file.gz.gzi | od --endian=little -An -t u8 ) last_size=$( bgzip -b $offset file.gz | wc -c ) echo $(( $last_offset + $last_size ))
tail to read the last 8 bytes from the index file which contains the uncompressed offset of the last block and
od to print is as an little-endian unsigned 64-bit integer (the format of the index file is described in the bgzip(1) man page).
bgzip to uncompress from that offset onwards and
wc to count the bytes in the uncompressed output.
If you don't happen to have the *.gzi index file, well, you could recreate it with
bgzip -r for an existing bgzipped file, but that of course requires processing the whole (compressed) file...