Question: How to get the histogram of a graph
gravatar for matthias.zytnicki
20 months ago by
matthias.zytnicki10 wrote:

I would like to retrieve the abundance histogram of a graph that I am building from a set of sequences.

I have seen how I can iterate on the graph and get the distribution by myself.  However, it seems, from the options, that the distribution is computed while building the graph.  The Histogram class comes with all the bells and whistles, and I would like to use it.  Am I right?  Can I access it?

I have also perused your examples, but I cannot find what I am looking for:

  • Example kmer/kmer13.cpp shows how to build the histogram for a Bank.

  • Example debruijn/debruijn26.cpp shows how the get the abundance of a node.

  • Example storage/storage6.cpp shows how to get the distribution from a file generated by DSK.

Any pointer?

gatb • 868 views
ADD COMMENTlink modified 20 months ago • written 20 months ago by matthias.zytnicki10

Many thanks, guys!

I am really sorry that I did not make my question more clear: I wanted to program it using the gatb-core package.

My aim is gatb in my program, which needs this histogram (and could make used of the Histogram class).

ADD REPLYlink written 20 months ago by matthias.zytnicki10

You can retrieve the histogram information as a typed collection in the HDF5 file. This can be done with the gatb-core library this way:

#include <gatb/gatb_core.hpp>
using namespace std;

int main (int argc, char* argv[])
    // We load the HDF5 file, output of DSK
    Storage* dskoutput = StorageFactory(STORAGE_HDF5).load (argv[1]);
    LOCAL (dskoutput);

    // We get the histogram group
    Group&  histoGroup = dskoutput->getGroup("histogram");

    // We get the histogram collection (couples of [index, abundance])
    Collection<IHistogram::Entry>& collection = histoGroup.getCollection<IHistogram::Entry> ("histogram");

    // We iterate the histogram information.
    Iterator<IHistogram::Entry>* iter = collection.iterator();
    LOCAL (iter);
    for (iter->first(); !iter->isDone(); iter->next())
        cout << iter->item().index << " " << iter->item().abundance << endl;

Actually, the Histogram class is not able to load the HDF5 file and provide an API for getting the information ; note that such a feature could be added in a future version.

ADD REPLYlink modified 20 months ago • written 20 months ago by edrezen670

Cool! Exactly what I needed.  Sorry, I did not get that DSK was internally used for creating the HDF5 file.

ADD REPLYlink written 20 months ago by matthias.zytnicki10
gravatar for pierre.peterlongo
20 months ago by
pierre.peterlongo600 wrote:

Hi Matthias,

Here is a quick and dirty answer.

First extract the histogram from the h5 file:

gatb-core/bin/h5dump -y -d histogram/histogram mygraph.h5 | grep "^\ *[0-9]" | paste - - | sed s/\ *//g | sed s/\,// > histo_data

Then plot the distribution:

Rscript plot-kmer-histo.R histo_data

With plot-kmer-histo.R (from Claire Lemaitre)


#usage :  ./plot-kmer-histo.R  dsk.histo
# or      RScript plot-kmer-histo.R  dsk.histo
if (length(args)<1) {
cat("./plot-kmer-histo.R dsk.histo [output.png]\n")
cat("1 obligatory argument :\n   - kmer histo file (as output by DSK)\n")
cat("optional arguments :\n   - output file (default = input_file.png)\n")



if (length(args)>1) {
suppressWarnings (plot(tab$V1,tab$V2,type="h",log="y",xlab="kmer coverage",ylab="count",main=""))


ADD COMMENTlink written 20 months ago by pierre.peterlongo600

I accepted this answer, because it precisely answer the (ill formulated) question.

I also upvoted the comment that helped me solving my problem.

ADD REPLYlink written 20 months ago by matthias.zytnicki10
gravatar for edrezen
20 months ago by
edrezen670 wrote:


If you just want get the histogram without programming, you can use the dbgh5 binary provided by the gatb-core archive.

For example, you can run 'dbgh5 -in myreads.fa -bloom none' and you will get as output a HDF5 file 'myreads.h5' holding information about the kmers and their abundances ('-bloom none' is used to stop just after the kmers counting).

Then, you can use h5dump (also provided by gatb-core) to extract the histogram information from the h5 file with the following command :

h5dump -y -d histogram/histogram myreads.h5 | grep [0-9] | grep -v [A-Z].* | paste - -

=> each line gives a couple [x,y] where x is the kmer occurence and y how many distinct kmers have occurence number x.

If you want to get the kmers histogram while using the gatb-core library in your own C/C++ software, I can give you an explicit code sample.


ADD COMMENTlink written 20 months ago by edrezen670
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 972 users visited in the last hour