Has anybody ever used the HDF5 API to store some biological data (genotypes...). I know about this kind of reference (BioHDF...) but I'm looking for some source code I could browse to understand how I can access data faster.
PS: hum, I'm a new user. I'm not allowed to add the following tags: storage database hdf5 source code
In the GeneTrack software we have used HDF to store values for each genomic base. Its main advantage over other storage systems was that it was able to return consecutive values with minimal overhead.
For example it is extremely fast (ms) in retrieving say 100,000 consecutive values starting with a certain index.We used the Python bindings to HDF. An added advantage of these bindings is that they will return the data back as numpy arrays (very fast numerical operations).
Here is the relevant code that deals with HDF only: hdf.py
The HDF schema is set up in a different module, but in the end it simply something like:
class MySchema( IsDescription ):
Stores a triplet of float values for each index.
ix = IntCol ( pos=1 ) # index
wx = FloatCol( pos=2 ) # values on the W (forward) strand
cx = FloatCol( pos=3 ) # value on the C (reverse) strand
ax = FloatCol( pos=4 ) # weighted value on the combined W + C strands
What I do have is a netCDF-3 based Java application that I could show you.
NetCDF-3 is basically the same idea as HDF, but quite more limited as it cannot do compound datatypes among other limitations.
I have been talking with the BioHDF guys and from what they tell me, their work will be centered around a number of command-line APIs, written in C, that will address some areas of usage which for now do not seem to overlap.
I have been talking with them to see if we can achieve an API for saving genotype data. Don't know yet where that will lead me.
If you are looking for something more versatile, you will probably have to delve in the official HDF5 C code ( http://www.hdfgroup.org/HDF5/Tutor/ ), which seems to be the only one that offers all the functionality and goodies of that impressive storage system.
There is also a Perl binding to HDF5: PDL::IO::HDF5
This requires the Perl Data Language (PDL) package. The way, data-structures can be handled, sub-ranges of data can be defined an data can be manipulated is actually very elegant in PDL such that computational code can profit from PDLs vectorized style of writing expressions.
I've been talking a bit with one of the devs behind BioHDF (being at UIUC, up the road from The HDF Group doesn't hurt). I believe a publication is on the way describing it along with some implementation details.
Unfortunately I don't have any example to shows you yet.
I don't know how to program in C/C++ so I have been looking at two hdf5 wrappers in python, PyTables and H5PY.
PyTables has a database-like approach in which HDF5 is used as a sort of hierarchical database, in which a column can be a table itself, allowing to store nested data. For example, you have a table called 'SNPs' with two columns, 'id' and 'genotypes'; the column 'genotypes' contains a nested table, with the columns 'individual' and 'genotype'; and so on.
H5Py is basically a re-implementation of numpy's arrays, so you can store and access arrays/matrixes as you would do with numpy (it is similar to arrays and matrixes in matlab, R, and any other language with this data type) and they are stored in an HDF5 file so the access is faster.
Our Genomedata system stores multiple tracks of 1-bp resolution genomic data in a HDF5 array. Documentation and full source code is available on that page. It has a Python (PyTables) interface for reading the data. For originally loading it into HDF5, we wrote a C loader for added speed.