Implement Bgzip Into C Program
2
1
Entering edit mode
8.8 years ago
FGV ▴ 130

Dear all,

I'm currently writing a program that need random access to large files. After looking up a bit I found BGZIP, largely used on SAMTOOLS.

I tried to implement it into my program but I'm getting an error: "Error: invalid block header"

Also, if I set start_pos = 0 it works.

I've tried to decompress it with bgzip (compiled from samtools 1.18) and it works fine! here is the code I'm using:

    BGZF* in_glf_fh;
    unsigned int total_bytes_read = 0;

    // Define chunk start and end positions
    unsigned int start_pos = 2203 * 10000;
    unsigned int end_pos = start_pos + 10000 - 1;
    unsigned int chunk_size = end_pos - start_pos + 1;

    // Open input file
    in_glf_fh = bgzf_open(pars->in_glf, "rb");
    if( in_glf_fh == NULL )
        error("ERROR: cannot open GLF file!");

    // Search start position
    if( bgzf_seek(in_glf_fh, start_pos * pars->n_ind * 3 * sizeof(double), SEEK_SET) < 0 )
        error("ERROR: cannot seek GLF file!");

    // Read data from file
    for(unsigned int c = 0; c < chunk_size; c++) {
        int bytes_read = bgzf_read(in_glf_fh, chunk_data[c], sizeof(double) * pars->n_ind * 3);
        if( (unsigned int) bytes_read != sizeof(double) * pars->n_ind * 3 )
            fprintf(stderr, "Error: %s\n", in_glf_fh->error);
        total_bytes_read += bytes_read;
    }

    bgzf_close(in_glf_fh);

thanks in adv,

FGV

samtools tabix c • 3.0k views
ADD COMMENT
0
Entering edit mode

what are start_pos, end_pos, chunk_size? how do you know if your offset in bgzf_read is not "out of bounds"?

ADD REPLY
0
Entering edit mode

chunk_size is the amount of data I want to read (10000 in this case)

start_pos and end_pos is the interval I want to read from the BGZIP file...

ADD REPLY
0
Entering edit mode

cross-posted on the samtools-dev mailing list: http://sourceforge.net/mailarchive/message.php?msg_id=29974208

ADD REPLY
2
Entering edit mode
8.8 years ago
FGV ▴ 130

Thank you all for your replies. However, Heng Li pointed me towards the right direction: use RAZF library (also in SAMTOOLS).

Indeed is much faster (~20x) than standard GZIP and also allows for normal file offset searches. BGZIP apparently only accepts virtual offsets, that have to be previously computed and stored using bgzf_ftell().

ADD COMMENT
0
Entering edit mode
8.8 years ago

If you want random access (and you have the byte offsets for each of the compressed streams contained within an archive), you can just use bzip2 or gzip libraries with standard C I/O calls (fopen, fseek, etc.), instead of a third-party library built atop a third-party library. This is what I did for starch / unstarch in the BEDOPS suite. Code is available there if you want to take a look. Though I think bgzip is based on gzip, so the sample code on the zlib site should also help.

ADD COMMENT
0
Entering edit mode

I need to make lots of I/O operations on large files and so I need it to be fast. I actually have already implemented GZIP but it's just too slow.... each time I read from the file takes around ~10 seconds.

ADD REPLY
0
Entering edit mode

I'm not suggesting using gzip/gunzip command line programs, but rather using the sample code for extracting data as a starting point for your project. You would seek the desired byte offset within the file and then use zlib's inflate() function (assuming that the byte offset points to the start of one of the archive's multiple zip streams). I took a look at some bgzip code and I think it is mostly equivalent to this procedure, using zlib for backend compression.

ADD REPLY
0
Entering edit mode

Yes, bgzip concatenates gzip blocks as you said. By using bgzip, you just save efforts on rolling your own implementation, which is not hard but will take a bit time. Bgzip is implemented in two files (without remote reading). When you use it, you put the two files in your source tree. You won't need endusers to pre-install the "bgzip library" - actually there is no way to do that.

BTW, I guess FGV was saying gzseek() from zlib is too slow. He was not using the gzip/gunzip command line programs.

ADD REPLY
0
Entering edit mode

@Alex, you're seeking on a gzipped file and getting good performance?

ADD REPLY
0
Entering edit mode

I know the offsets ahead of time, so I get O(1) or constant performance time (for my use case) by simply jumping to where I know to go in the file stream. Are you seeking through raw bytes until you find a header?

ADD REPLY

Login before adding your answer.

Traffic: 1498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6