Hello, I am working with large files (eg. 10^7 lines) having a custom tabulated format (eg. geneid TAB SNPid TAB SNPcoord TAB measurex TAB measure_y TAB ...). And I have two wishes / requirements:
- compress the files to save disk space;
- quickly access some lines that are not in the order in which they appear in the file.
I am coding in C++. Up to now, I was using Gzstream to easily read gzipped files (I am too much of a beginner to try to use Zlib directly). But it doesn't allow random access. Thus, I am still using uncompressed files. Typically, I first go through the whole file and record the stream position of the lines I am interested in (using tellg). And then, I access these lines using the stream positions previously recorded (using getline). However, I would like to be able to do the same on compressed files.
From what I read (eg. mentioned here), BGZF allows to do exactly that. I could thus theoretically use this in my code. Has anyone tried to do it? As I am more a geneticist than a programmer, is there any code snippet somewhere I could try to reuse?
Otherwise, I can also design my own minimal binary format. Although it would be quick to implement, it is ad-hoc... Should I rather look into the HDF5 format (eg. here)? Using h5dump, it seems possible to access only to a subset of the data. But has anyone tried to use the functions directly from his C/C++ code?