Question: Does GATB support paired reads
gravatar for cts
2.4 years ago by
cts1.4k wrote:

I've been looking at the snippets and examples in the source code but none of them mention using paired read information. Is it possible with GATB to create graphs that are decorated using paired read information?

gatb • 819 views
ADD COMMENTlink modified 2.4 years ago by edrezen670 • written 2.4 years ago by cts1.4k
gravatar for edrezen
2.4 years ago by
edrezen670 wrote:


It is not directly possible to support paired reads from the graph API of GATB-CORE.

However, we have recently added the possibility to decorate the nodes of the graph with any kind of information, so it would be possible (after graph creation) to map information from the reads (paired reads in your case) to the nodes of the de Bruijn graph.

This new feature is done by using a minimal perfect hash function library, EMPHF; such a hash function takes about 2.61 bits per node (to be added to the about 8.6 bits per node for storing the de Bruijn graph). There is of course to add N bits per node, where N is the information you want to decorate the nodes with.

If you are interested, I could add some snippets showing how to decorate the nodes with information from paired reads (assuming that two consecutive reads makes a pair)


ADD COMMENTlink written 2.4 years ago by edrezen670

Erwan, let's make the mphf announcement more "official". it deserves more exposure. Gatb-core release 1.0.5 from sept 9 doesn't seem to include your latest patches, so I didn't want to talk about it yet. Let me know when 1.0.6 is released and i'll make a proper biostar post

ADD REPLYlink written 2.4 years ago by Rayan Chikhi1.1k

 cts asked a very good question. My take on it is that de bruijn graphs are typically not decorated with reads, even unpaired ones, as naively storing a read-node association could be quite memory-expensive. (naively, all kmers from a read would have to be associated to the ID of that read; assuming a billion reads and 100 kmers per read, 4 bytes per read ID, that's at least 400 GB of memory just for this association)

That's essentially the reason why Velvet, which does store read info in the graph, is not really memory-efficient, and SOAPdenovo's main improvement upon it was to remove most of the read tracking from the in-memory graph.

There has been research to create dBGs that incorporate paired read information more cleverly, but to the best of my knowledge, they haven't scaled to large genomes.

That being said, many assemblers still manage to go back to the reads after having constructed the dBG. It is often done by mapping the read to a condensed graph, where all simple paths are replaced by single nodes. GATB does not support a condensed graph data structure. (The BCALM/dbgfm suite might be better suited for this task, see this paper)

Anyhow, the GATB philosophy is that you can already achieve many tasks with a simple dBG, one that is not annotated with reads. If this is not enough, one can add a post-processing step that map the reads back to sequences constructed from the dBG (similar idea from the previous paragraph). E.g. the discoSNP tool follows this idea (one module implements a dBG and outputs results, another module checks/annotates the results with the reads).

An alternative route is to decorate the GATB graph with custom information using the EMPHF, that's a new feature that we'll announce shortly but Erwan gave you a preview :) Although it's not a silver bullet, as I said above, storing a read-node association could be quite expensive.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Rayan Chikhi1.1k

Thank you Rayan for your detailed comments

ADD REPLYlink written 2.4 years ago by cts1.4k

Yes some snippets on decorating nodes with information would be much appreciated, thankyou 

ADD REPLYlink written 2.4 years ago by cts1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1751 users visited in the last hour