Question: How to get the histogram of a graph
gravatar for matthias.zytnicki
5.4 years ago by
matthias.zytnicki10 wrote:

I would like to retrieve the abundance histogram of a graph that I am building from a set of sequences.

I have seen how I can iterate on the graph and get the distribution by myself.  However, it seems, from the options, that the distribution is computed while building the graph.  The Histogram class comes with all the bells and whistles, and I would like to use it.  Am I right?  Can I access it?

I have also perused your examples, but I cannot find what I am looking for:

  • Example kmer/kmer13.cpp shows how to build the histogram for a Bank.

  • Example debruijn/debruijn26.cpp shows how the get the abundance of a node.

  • Example storage/storage6.cpp shows how to get the distribution from a file generated by DSK.

Any pointer?

gatb • 1.9k views
ADD COMMENTlink modified 5 months ago by Biostar ♦♦ 20 • written 5.4 years ago by matthias.zytnicki10

Many thanks, guys!

I am really sorry that I did not make my question more clear: I wanted to program it using the gatb-core package.

My aim is gatb in my program, which needs this histogram (and could make used of the Histogram class).

ADD REPLYlink written 5.4 years ago by matthias.zytnicki10

You can retrieve the histogram information as a typed collection in the HDF5 file. This can be done with the gatb-core library this way:

#include <gatb/gatb_core.hpp>
using namespace std;

int main (int argc, char* argv[])
    // We load the HDF5 file, output of DSK
    Storage* dskoutput = StorageFactory(STORAGE_HDF5).load (argv[1]);
    LOCAL (dskoutput);

    // We get the histogram group
    Group& histoGroup = dskoutput->getGroup("histogram");

    // We get the histogram collection (couples of [index, abundance])
    Collection<IHistogram::Entry>& collection = histoGroup.getCollection<IHistogram::Entry> ("histogram");

    // We iterate the histogram information.
    Iterator<IHistogram::Entry>* iter = collection.iterator();
    LOCAL (iter);
    for (iter->first(); !iter->isDone(); iter->next())
        cout << iter->item().index << " " << iter->item().abundance << endl;

Actually, the Histogram class is not able to load the HDF5 file and provide an API for getting the information ; note that such a feature could be added in a future version.

ADD REPLYlink modified 5 months ago by _r_am32k • written 5.4 years ago by edrezen720

Cool! Exactly what I needed.  Sorry, I did not get that DSK was internally used for creating the HDF5 file.

ADD REPLYlink written 5.4 years ago by matthias.zytnicki10
gravatar for pierre.peterlongo
5.4 years ago by
pierre.peterlongo860 wrote:

Hi Matthias,

Here is a quick and dirty answer.

First extract the histogram from the h5 file:

gatb-core/bin/h5dump -y -d histogram/histogram mygraph.h5 | grep "^\ *[0-9]" | paste - - | sed s/\ *//g | sed s/\,// > histo_data

Then plot the distribution:

Rscript plot-kmer-histo.R histo_data

With plot-kmer-histo.R (from Claire Lemaitre)


#usage :  ./plot-kmer-histo.R  dsk.histo
# or      RScript plot-kmer-histo.R  dsk.histo
if (length(args)<1) {
cat("./plot-kmer-histo.R dsk.histo [output.png]\n")
cat("1 obligatory argument :\n   - kmer histo file (as output by DSK)\n")
cat("optional arguments :\n   - output file (default = input_file.png)\n")



if (length(args)>1) {
suppressWarnings (plot(tab$V1,tab$V2,type="h",log="y",xlab="kmer coverage",ylab="count",main=""))
ADD COMMENTlink modified 5 months ago by _r_am32k • written 5.4 years ago by pierre.peterlongo860

I accepted this answer, because it precisely answer the (ill formulated) question.

I also upvoted the comment that helped me solving my problem.

ADD REPLYlink written 5.4 years ago by matthias.zytnicki10
gravatar for edrezen
5.4 years ago by
edrezen720 wrote:


If you just want get the histogram without programming, you can use the dbgh5 binary provided by the gatb-core archive.

For example, you can run dbgh5 -in myreads.fa -bloom none and you will get as output a HDF5 file myreads.h5 holding information about the kmers and their abundances (-bloom none is used to stop just after the kmers counting).

Then, you can use h5dump (also provided by gatb-core) to extract the histogram information from the h5 file with the following command:

h5dump -y -d histogram/histogram myreads.h5 | grep [0-9] | grep -v [A-Z].* | paste - -

=> each line gives a couple [x,y] where x is the kmer occurence and y how many distinct kmers have occurence number x.

If you want to get the kmers histogram while using the gatb-core library in your own C/C++ software, I can give you an explicit code sample.

ADD COMMENTlink modified 5 months ago by _r_am32k • written 5.4 years ago by edrezen720
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2609 users visited in the last hour