4.4 years ago by
As one of the people on the BioHDF project, I should probably chime in here.
For those who are unfamiliar, the simplest explanation of HDF5 is that it essentially allows you to store multidimensional arrays ('datasets') of data in a regular file. The arrays can be of one of our pre-determined types (that, unsurprisingly, closely match C types) or a user-created type which can be stored in the file. These arrays can be organized using a filesystem-like structure of what we call 'groups'. You can also annotate these groups, stored types, and datasets with what we call 'attributes', which are just small data elements (ints, strings, etc.). Everything is stored in a binary format which we publish. Nothing is stopping you from writing your own I/O library based on said format, but most people use our C library, either alone or wrapped in their favorite language, for access. I've occasionally heard of HDF5 described as a 'binary file format construction kit' since the semantics of the groups, datasets, etc. is up to the user.
The main reason that we stopped development on BioHDF was that the project ended and the ecosystem had already settled on BAM. My company (The HDF Group) is not a research lab and we are not biologists, so it would be difficult for us to push the project forward. The BioHDF code that exists is not completely finished and was basically a tech demo, so performance is not stellar. That said, HDF5 could conceivably be a useful storage medium for NGS data, with some caveats.
There would definitely be some benefits to HDF5 as an NGS data container:
- HDF5 supports MPI-IO, so it would be easier to write parallel programs.
- HDF5 has a built-in cache, which can make I/O more performant (depends on I/O pattern).
- HDF5 is in wide use so existing data analysis tools (Matlab, etc.) could read the files.
- HDF5 almost certainly scales better than any flat format.
- HDF5 is supported by a company, so there's a help desk you can call/email, etc.
- HDF5 supports flexible and heterogeneous data storage. For example, you could have a core set of datasets and groups that make up a 'schema' (HDF5 doesn't support formal schemas at this time) and individual vendors and labs could add their own data objects without disturbing queries based on the core. You could do this without coordinating with a central authority - any extra data would just be ignored by a reader that only understood the core schema. This also means that you could do nice things like creating your own indexes and storing them the file.
- HDF5 is probably going to be around for a long time. We spend a LOT of time making sure our file format and tools are both backward- and (to a certain extent) forward-compatible. NASA is a huge supporter of us and they keep data forever.
- The compression scheme is flexible. We support compression plugins via an API and you can use different compression on different datasets.
Some downsides of HDF5 as an NGS storage format (aside from the ecosystem thing):
- Variable-length string storage is not compressed. Regular arrays of strings are compressed, but VL strings are stored as pointers into a different file structure that is not compressed in the current version of the library. For NGS reads that have a particular length, this is not a problem. Unfortunately, for other types of string data that are not so regular you either have to over-specify the string length or store the strings in a concatenated 1D dataset if you want compression (this is why the default format in BioHDF is so slow, btw). This is a fixable problem, however. We just haven't had the resources (time, $$$) to fix this.
- The C API is really close to the metal and has a steep learning curve. We do have a high-level API that's easier to use, as well as Java and Python bindings (h5py), though, and I would expect that an HDF5-backed NGS storage scheme would have its own, easier-to-use, API.
- Subsets of BAM files can be obtained using ftp's ability to download n bytes of data at a particular file offset, which is very helpful for NGS data browsers. You can't do this with HDF5 due to the file format's complexity. We do have an HDF5 server in the works, but that project will need some more development time before the server is robust enough for public Internet use.
- I get the impression that a lot of pipelines stream BAM files to SAM and then parse the output. If you are simply moving a linear sequence of bytes from point A to point B many of the features of the HDF5 file format and library are just overhead.
So that's my two cents on using HDF5 as an NGS format. It's certainly doable and we'd love to work with anyone who has the clout to push such a thing forward (firstname.lastname@example.org). Breaking into the BAM ecosystem would be tough, but it could be done with enough outreach and resources, I think.
One last thing: I'd be remiss if I didn't mention SeqDB here. It's an HDF5-backed replacement for FASTQ files and shows that you can use HDF5 for genomic data with good performance.
Actually, there's even one more thing I should mention - Maintainers of binary formats occasionally switch to using us as the underlying storage format when they get sick of worrying about things like platform-independence and scalability. netCDF is an example of this. They switched to using HDF5 in version 4 of the format. I asked Heng about this once, but he said that (at the time) most pipelines that he knew of simply streamed BAM files to SAM and didn't use the C API, making such a thing less useful.