Biopython byte positions are not compatible with bgzip
1
0
Entering edit mode
8 months ago
b10hazard • 0

I have a blocked gzip file where the data I want is between two byte indexes, which I determined using biopython's BgzfReader fh.tell() fuction. I can easily access this data using this code...

from Bio.bgzf import BgzfReader

start = 75191629497
stop =  75191634445

break
print line


The code above works perfectly and prints out the expected data.

My problem is that these offsets do not work for other htslib utilities. For example, the bgzip command line utility has a -b option for the start byte offset and a -s option for the size of the data you want to decompress. Using the above example the size would be 75191634445 - 75191629497 or 4948 bytes. So I tried the following:

bgzip -c -b 75191629497 -s 4948 /path/to/bgzip


This command doesn't work. I get a "Segmentation fault (core dumped)" error. My question is... Can the byte positions generated and used by biopython's BgzfReader be used with other htslib based applications? If so, how would I do this? Thanks.

htslib biopython bgzip • 289 views
0
Entering edit mode
8 months ago
Ahill ★ 1.9k

From the bgzip man page it looks like the command line bgzip -b option expects zero-based uncompressed offsets. But Bio.bgzf uses virtual offsets, which are not the same as uncompressed offsets. The Bgzf doc here looks like it may tell you how you can convert between virtual and uncompressed offsets: .