First, thank you for developing the excellent looking GATB library. I'm excited to use it to test out some new ideas, but I had a few basic questions about using the de Bruijn graph. First, I was curious how I retrieve a de Bruijn graph node given the k-mer to which it corresponds. I looked through the documentation, but I couldn't really find any such ability. Does the de Brujin graph in GATB support this (looking up a node by its k-mer?). Second, the release notes for the new version mention that it is now possible to store arbitrary information at node vertices, but I couldn't seem to dig up how to do this in the documentation. Are there any pointers to this?
Thank you for your interest in GATB. Concerning your questions:
Our implementation of the de Bruijn graph implies that you normally can't get a Node instance of the graph from an arbitrary kmer value; in fact, you can (a) iterate all the nodes of a given graph (from the information stored in a HDF5 file), and (b) get the neighbors of a given node N from a Bloom filter based structure in memory (but you have to be sure that node N is in the graph, so it should have been got from the all nodes iteration).
The only possibility is when you are sure that some kmer value corresponds to a node of your graph; then you can use the Graph::buildNode method that will return a Node instance from a kmer as an ASCII string (see here). It is also possible to get a Node instance from the integer value of a kmer; there is no snippet showing this but I can provide some example if you are interested.
If you really want to know if a kmer belongs to a graph, you must make yourself a lookup in the kmers stored in the HDF5 file. Although it looks tedious and time consuming, it must be noticed that kmers in the HDF5 files are stored in partitions (so one can look directly in the good partition from the query kmer) and that kmers of a partition are sorted which makes the lookup easy. I think I should add some snippet showing how to do this because some people could really need it.
You can indeed tag information for any node now; we use the very nice EMPHF library for doing it. The idea is to get a unique integer in interval [0,|G|-1] (where |G| is the number of nodes of the graph) from a kmer value of a node; then you can allocate an array of size |G| of any type of your choice and use the EMPH hash function to get the index of the array to be used. Rigth now, the documentation may lack some examples; you can have a look at the Graph::queryAbundance method (file Graph.cpp) that shows how to get the coverage of an arbitrary node. Again, I think we should add some snippet about this topic. Note: if you want get the EMPH hash function during the graph building, you have to use "-mphf emphf" (see here)