Question

Biopython byte positions are not compatible with bgzip

0

Entering edit mode

3.1 years ago

b10hazard • 0

I have a blocked gzip file where the data I want is between two byte indexes, which I determined using biopython's BgzfReader fh.tell() fuction. I can easily access this data using this code...

from Bio.bgzf import BgzfReader

start = 75191629497
stop =  75191634445

with BgzfReader(bgzip_path) as fh_reads:
    fh_reads.seek(start)
    for line in fh_reads:
        if fh_reads.tell() > stop:
            break
        print line

The code above works perfectly and prints out the expected data.

My problem is that these offsets do not work for other htslib utilities. For example, the bgzip command line utility has a -b option for the start byte offset and a -s option for the size of the data you want to decompress. Using the above example the size would be 75191634445 - 75191629497 or 4948 bytes. So I tried the following:

bgzip -c -b 75191629497 -s 4948 /path/to/bgzip

This command doesn't work. I get a "Segmentation fault (core dumped)" error. My question is... Can the byte positions generated and used by biopython's BgzfReader be used with other htslib based applications? If so, how would I do this? Thanks.

htslib biopython bgzip • 803 views

ADD COMMENT • link updated 3.1 years ago by Ahill ★ 1.9k • written 3.1 years ago by b10hazard • 0

score 0 · Answer 1 · 2021-03-05

From the bgzip man page it looks like the command line bgzip -b option expects zero-based uncompressed offsets. But Bio.bgzf uses virtual offsets, which are not the same as uncompressed offsets. The Bgzf doc here looks like it may tell you how you can convert between virtual and uncompressed offsets: .