Random Access Of Lines In A Compressed File Having A Custom Tabulated Format?
3
3
Entering edit mode
10.1 years ago
tflutre ▴ 540

Hello, I am working with large files (eg. 10^7 lines) having a custom tabulated format (eg. geneid TAB SNPid TAB SNPcoord TAB measurex TAB measure_y TAB ...). And I have two wishes / requirements:

• compress the files to save disk space;
• quickly access some lines that are not in the order in which they appear in the file.

I am coding in C++. Up to now, I was using Gzstream to easily read gzipped files (I am too much of a beginner to try to use Zlib directly). But it doesn't allow random access. Thus, I am still using uncompressed files. Typically, I first go through the whole file and record the stream position of the lines I am interested in (using tellg). And then, I access these lines using the stream positions previously recorded (using getline). However, I would like to be able to do the same on compressed files.

From what I read (eg. mentioned here), BGZF allows to do exactly that. I could thus theoretically use this in my code. Has anyone tried to do it? As I am more a geneticist than a programmer, is there any code snippet somewhere I could try to reuse?

Otherwise, I can also design my own minimal binary format. Although it would be quick to implement, it is ad-hoc... Should I rather look into the HDF5 format (eg. here)? Using h5dump, it seems possible to access only to a subset of the data. But has anyone tried to use the functions directly from his C/C++ code?

tabix c • 3.5k views
1
Entering edit mode

When accessing the file, if you can order your seeks so that they are sequential by file order they will be much faster. If that's not possible, putting the data on an SSD which is very quick for seeks can speed things up.

6
Entering edit mode
10.1 years ago

The code for samtools contains a very simple API (open, seek, close...) for bgzf: http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/samtools/ (see bgzf.h and bgzf.c)

In C++, you can also simply wrap this code in a class to make it more OO oriented. For example (not tested)

compile:

 g++ -c xbgzf.cpp -I /path/to/samtools-0.1.18


#ifndef XBGZF_H_
#define XBGZF_H_

#include <stdint.h>
class BgzFile
{
private:
void* _ptr;
public:
BgzFile(int fd,const char* m);
BgzFile(const char* filename,const char* m);
virtual ~BgzFile();
virtual void close();
virtual int write(void* data,int length);
virtual int64_t tell();
virtual int64_t seek(int64_t pos, int where);
virtual void set_cache_size(int cache_size);
virtual int getc();
};


source:

/*
* xbgzf.cpp
*
*  Created on: Oct 28, 2011
*      Author: lindenb
*/
#include <sstream>
#include <cstring>
#include <cerrno>
#include <stdexcept>
#include <bgzf.h>
#include "xbgzf.h"
using namespace std;

#define CASTPTR ((BGZF *)(this->_ptr))

BgzFile::BgzFile(const char* filename,const char* mode):_ptr(NULL)
{
_ptr=::bgzf_open(filename,mode);
if(_ptr==NULL)
{
ostringstream os;
os << "Cannot open "<< filename << " " << strerror(errno);
throw runtime_error(os.str());
}
}
BgzFile::BgzFile(int fd,const char* mode):_ptr(NULL)
{
_ptr=::bgzf_fdopen(fd,mode);
if(_ptr==NULL)
{
ostringstream os;
os << "Cannot open fd " << strerror(errno);
throw runtime_error(os.str());
}
}
BgzFile::~BgzFile()
{
close();
}
void BgzFile::close()
{
if(_ptr!=NULL) ::bgzf_close(CASTPTR);
_ptr=NULL;
}
{
}

int BgzFile::write(void* data,int length)
{
return _ptr==NULL? -1:bgzf_write(CASTPTR,data,length);
}
int64_t BgzFile::tell()
{
return _ptr==NULL? -1:bgzf_tell(CASTPTR);
}
int64_t BgzFile::seek(int64_t pos, int where)
{
return _ptr==NULL? -1:bgzf_seek(CASTPTR,pos,where);
}
void BgzFile::set_cache_size(int cache_size)
{
if( _ptr!=NULL) ::bgzf_set_cache_size(CASTPTR,cache_size);
}
int BgzFile::getc()
{
return _ptr==NULL? -1:bgzf_getc(CASTPTR);
}


About HDF : as far as I've understand HDF, it is a standard way to store some binary data into a structured file. But it has no capabilities to search for a given value. You'll have to implement the logic for searching by yourself. One more word: HDF is a headache for programming (many resources have to be copied/released/etc...) even for doing simple things.

0
Entering edit mode

@Pierre Lindenbaum Thanks a lot! So first I need to go through the whole file using read, and recording the beginning of lines I'm interested in using tell. And then, in a later stage, I directly access those lines using seek, right? (Knowing that I need to use length properly to be sure I read the entire line, of course.)

0
Entering edit mode

yes, and you can also stores the indexes of the lines in a second binary file where all the records(=offsets in the 1st file) have the same length(sizeof(int64_t) ) (so you just have to seek(index*sizeof(int64_t)); )

2
Entering edit mode
10.0 years ago
tflutre ▴ 540

Thanks to everyone that proposed a solution! Finally, I decided to use the bgzip utility to compress my large input files, and then I implemented a bgzf_getline function that uses bgzf_getc().

1) to obtain and compile bgzip:

# download the source code from SAMTOOLS
wget -O bgzip.c http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/samtools/bgzip.c?revision=998
wget -O bgzf.h http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/samtools/bgzf.h?revision=998
wget -O bgzf.c http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/samtools/bgzf.c?revision=998
wget -O khash.h http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/samtools/khash.h?revision=998

# compile bgzip (require ZLIB)
gcc -Wall bgzf.c bgzip.c -o bgzip -lz
./bgzip


2) to test it:

# test it: zcat/gunzip can decompress a file compressed with bgzip (converse is wrong)
ls ~ > test.txt
./bgzip test.txt
zcat test.txt.gz
gunzip test.txt.gz
./bgzip test.txt
./bgzip -d test.txt.gz


3) here is my bgzf_getline function (I'm sure it's not the best way to code it, notably because it mixes C and C++ functions code, but it works):

void bgzf_getline (BGZF * stream, string & line)
{
line.erase();
int c = -1, i = 0;
char * p = (char *) malloc (1);
if (p == NULL)
{
fprintf (stderr, "ERROR: can't allocate memory\n");
exit (1);
}
while (true)
{
++i;
c = bgzf_getc (stream);
if (c == -2)
{
fprintf (stderr, "ERROR: can't read %i-th character\n", i);
exit (1);
}
if (c == -1)  // reach end of file
break;
if (c == 10)  // reach end of line
break;
//    printf ("byte #%i: %i %c\n", i, c, c);
sprintf (p, "%c", c);
line.append(p);
}
free (p);
}


4) finally, here is a gist of a dummy program showing how to use all this in practice:

https://gist.github.com/1434187

0
Entering edit mode

A more efficient implementation of a getline() is here: https://github.com/attractivechaos/klib/blob/master/bgzf.c at the end of the file. But I guess most would not care about this difference. This version of bgzf.c also kills the dependency to khash.h.

0
Entering edit mode

Thanks, I had a look at your bgzf_getline(), but what does the argument delim correspond to? And why the third argument is of type kstring_t and not char *? By any chance, would you have an example of using this function somewhere?

1
Entering edit mode
10.1 years ago
brentp 23k

You might use tabix directly (which uses bgzf), and save yourself some programming. You can index your file and then do queries on it quite easily from the command-line.

If you do use bgzf.h, I think you'll see it's quite simple, if you can open a file in C, you can use bgzf:

int bgzf_read(BGZF* fp, void* data, int length);
int bgzf_write(BGZF* fp, const void* data, int length);
#define bgzf_tell(fp) ((fp->block_address << 16) | (fp->block_offset & 0xFFFF))
int64_t bgzf_seek(BGZF* fp, int64_t pos, int where);


etc.

0
Entering edit mode

right, but note that tabix will only work if you have the columns chrom/start/end

0
Entering edit mode

It supports VCF, where supposedly, it just uses start == end. And either the data has a chromosome, or one could be faked. But, I'm lazy like that.

0
Entering edit mode

@brentp Thanks! Ok, indeed using the C code directly may be enough. I will start a small example to test it then. The only issue with read is that I need to specify length and it can be shorter or longer than an actual uncompressed line, right? So I could also use bgzf_getc and stop when I encounter 'n', no?

0
Entering edit mode

You can use bgzf_getc(). A little faster way is to use ti_readline() in index.c. But the latter is also a little more complicated.