Implement Bgzip Into C Program
2
1
Entering edit mode
8.8 years ago
FGV ▴ 130

Dear all,

I'm currently writing a program that need random access to large files. After looking up a bit I found BGZIP, largely used on SAMTOOLS.

I tried to implement it into my program but I'm getting an error: "Error: invalid block header"

Also, if I set start_pos = 0 it works.

I've tried to decompress it with bgzip (compiled from samtools 1.18) and it works fine! here is the code I'm using:

    BGZF* in_glf_fh;

// Define chunk start and end positions
unsigned int start_pos = 2203 * 10000;
unsigned int end_pos = start_pos + 10000 - 1;
unsigned int chunk_size = end_pos - start_pos + 1;

// Open input file
in_glf_fh = bgzf_open(pars->in_glf, "rb");
if( in_glf_fh == NULL )
error("ERROR: cannot open GLF file!");

// Search start position
if( bgzf_seek(in_glf_fh, start_pos * pars->n_ind * 3 * sizeof(double), SEEK_SET) < 0 )
error("ERROR: cannot seek GLF file!");

for(unsigned int c = 0; c < chunk_size; c++) {
if( (unsigned int) bytes_read != sizeof(double) * pars->n_ind * 3 )
fprintf(stderr, "Error: %s\n", in_glf_fh->error);
}

bgzf_close(in_glf_fh);


FGV

samtools tabix c • 3.0k views
0
Entering edit mode

what are start_pos, end_pos, chunk_size? how do you know if your offset in bgzf_read is not "out of bounds"?

0
Entering edit mode

chunk_size is the amount of data I want to read (10000 in this case)

start_pos and end_pos is the interval I want to read from the BGZIP file...

0
Entering edit mode

cross-posted on the samtools-dev mailing list: http://sourceforge.net/mailarchive/message.php?msg_id=29974208

2
Entering edit mode
8.8 years ago
FGV ▴ 130

Thank you all for your replies. However, Heng Li pointed me towards the right direction: use RAZF library (also in SAMTOOLS).

Indeed is much faster (~20x) than standard GZIP and also allows for normal file offset searches. BGZIP apparently only accepts virtual offsets, that have to be previously computed and stored using bgzf_ftell().

0
Entering edit mode
8.8 years ago

If you want random access (and you have the byte offsets for each of the compressed streams contained within an archive), you can just use bzip2 or gzip libraries with standard C I/O calls (fopen, fseek, etc.), instead of a third-party library built atop a third-party library. This is what I did for starch / unstarch in the BEDOPS suite. Code is available there if you want to take a look. Though I think bgzip is based on gzip, so the sample code on the zlib site should also help.

0
Entering edit mode

I need to make lots of I/O operations on large files and so I need it to be fast. I actually have already implemented GZIP but it's just too slow.... each time I read from the file takes around ~10 seconds.

0
Entering edit mode

I'm not suggesting using gzip/gunzip command line programs, but rather using the sample code for extracting data as a starting point for your project. You would seek the desired byte offset within the file and then use zlib's inflate() function (assuming that the byte offset points to the start of one of the archive's multiple zip streams). I took a look at some bgzip code and I think it is mostly equivalent to this procedure, using zlib for backend compression.

0
Entering edit mode

Yes, bgzip concatenates gzip blocks as you said. By using bgzip, you just save efforts on rolling your own implementation, which is not hard but will take a bit time. Bgzip is implemented in two files (without remote reading). When you use it, you put the two files in your source tree. You won't need endusers to pre-install the "bgzip library" - actually there is no way to do that.

BTW, I guess FGV was saying gzseek() from zlib is too slow. He was not using the gzip/gunzip command line programs.

0
Entering edit mode

@Alex, you're seeking on a gzipped file and getting good performance?

0
Entering edit mode

I know the offsets ahead of time, so I get O(1) or constant performance time (for my use case) by simply jumping to where I know to go in the file stream. Are you seeking through raw bytes until you find a header?