Question: Implement Bgzip Into C Program
1
gravatar for FGV
7.1 years ago by
FGV110
FGV110 wrote:

Dear all,

I'm currently writing a program that need random access to large files. After looking up a bit I found BGZIP, largely used on SAMTOOLS.

I tried to implement it into my program but I'm getting an error: "Error: invalid block header"

Also, if I set start_pos = 0 it works.

I've tried to decompress it with bgzip (compiled from samtools 1.18) and it works fine! here is the code I'm using:

    BGZF* in_glf_fh;
    unsigned int total_bytes_read = 0;

    // Define chunk start and end positions
    unsigned int start_pos = 2203 * 10000;
    unsigned int end_pos = start_pos + 10000 - 1;
    unsigned int chunk_size = end_pos - start_pos + 1;

    // Open input file
    in_glf_fh = bgzf_open(pars->in_glf, "rb");
    if( in_glf_fh == NULL )
        error("ERROR: cannot open GLF file!");

    // Search start position
    if( bgzf_seek(in_glf_fh, start_pos * pars->n_ind * 3 * sizeof(double), SEEK_SET) < 0 )
        error("ERROR: cannot seek GLF file!");

    // Read data from file
    for(unsigned int c = 0; c < chunk_size; c++) {
        int bytes_read = bgzf_read(in_glf_fh, chunk_data[c], sizeof(double) * pars->n_ind * 3);
        if( (unsigned int) bytes_read != sizeof(double) * pars->n_ind * 3 )
            fprintf(stderr, "Error: %s\n", in_glf_fh->error);
        total_bytes_read += bytes_read;
    }

    bgzf_close(in_glf_fh);

thanks in adv,

FGV

samtools tabix C • 2.7k views
ADD COMMENTlink modified 5.4 years ago by Biostar ♦♦ 20 • written 7.1 years ago by FGV110

what are startpos, endpos, chunksize ? how do you know if your offset in bgzfread is not "out of bounds" ?

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum124k

chunk_size is the ammount of data I want to read (10000 in this case)

startpos and endpos is the interval I want to read from the BGZIP file...

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by FGV110

cross-posted on the samtools-dev mailing list: http://sourceforge.net/mailarchive/message.php?msg_id=29974208

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum124k
2
gravatar for FGV
7.1 years ago by
FGV110
FGV110 wrote:

Thank you all for your replies. However, Heng Li pointed me towards the right direction: use RAZF library (also in SAMTOOLS).

Indeed is much faster (~20x) than standard GZIP and also allows for normal file offset searches. BGZIP apparently only accepts virtual offsets, that have to be previously computed and stored using bgzf_ftell().

ADD COMMENTlink written 7.1 years ago by FGV110
0
gravatar for Alex Reynolds
7.1 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

If you want random access (and you have the byte offsets for each of the compressed streams contained within an archive), you can just use bzip2 or gzip libraries with standard C I/O calls (fopen, fseek, etc.), instead of a third-party library built atop a third-party library. This is what I did for starch / unstarch in the BEDOPS suite. Code is available there if you want to take a look. Though I think bgzip is based on gzip, so the sample code on the zlib site should also help.

ADD COMMENTlink modified 5.4 years ago • written 7.1 years ago by Alex Reynolds29k

I need to make lots of I/O operations on large files and so I need it to be fast. I actually have already implemented GZIP but it's just too slow.... each time I read from the file takes around ~10 seconds.

ADD REPLYlink written 7.1 years ago by FGV110

I'm not suggesting using gzip/gunzip command line programs, but rather using the sample code for extracting data as a starting point for your project. You would seek the desired byte offset within the file and then use zlib's inflate() function (assuming that the byte offset points to the start of one of the archive's multiple zip streams). I took a look at some bgzip code and I think it is mostly equivalent to this procedure, using zlib for backend compression.

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Alex Reynolds29k

Yes, bgzip concatenates gzip blocks as you said. By using bgzip, you just save efforts on rolling your own implementation, which is not hard but will take a bit time. Bgzip is implemented in two files (without remote reading). When you use it, you put the two files in your source tree. You won't need endusers to pre-install the "bgzip library" - actually there is no way to do that.

BTW, I guess FGV was saying gzseek() from zlib is too slow. He was not using the gzip/gunzip command line programs.

ADD REPLYlink written 5.4 years ago by lh331k

@Alex, you're seeking on a gzipped file and getting good performance?

ADD REPLYlink written 7.1 years ago by brentp23k

I know the offsets ahead of time, so I get O(1) or constant performance time (for my use case) by simply jumping to where I know to go in the file stream. Are you seeking through raw bytes until you find a header?

ADD REPLYlink modified 5.4 years ago • written 7.1 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1118 users visited in the last hour