3.3 years ago by

United States

On the other bioinfo Q&A, one of the best questions is about the difference between mathematical and bioinformatics de Bruijn graphs. That question made me think a lot. Here are some of my reflections.

According to wiki, in a k-order "mathematical" de Bruijn graph over alphabet {A,C,G,T}, each vertex is a k-mer; there is an edge between two k-mers if and only if they have a (k-1)-bp overlap. Each edge can be uniquely identified by a (k+1)-mer and vice versa. As such, this graph *always* has `4^k`

vertices and `4^{k+1}`

edges. In the following, I will call this a full de Bruijn graph.

In the context of sequence assembly, we have a set of reads as input. The k-order assembly de Buijn graph derived from the reads is strictly a subgraph of the full graph. It is either *edge-induced* from (k+1)-mers present in the input, or *vertex-induced* from k-mers. Vertex-induced and edge-induced graphs are different, though: vertex-induction requires at least (k-1)-bp overlap between reads while edge-induction requires at least k-bp overlap – edge-induced graph is smaller.

Here is an example. Say we have two 3-bp reads {ACC,CCG} and want to construct a 3-order assembly graph. If we use edge-induction, the graph is empty because there are no 4-mers in reads. If we use vertex-induction, the graph is "ACC=>CCG" because inducing the two vertices automatically induces the edge between them.

Back to the original question. Mathematically, a k-order assembly de Buijn graph is a subgraph of the full k-order de Bruijn graph. As such, in the assembly graph, vertices are k-mers. What vertices and edges are present in this subgraph depends on how the graph is constructed: by edge-induction or by vertex-induction. In implementation, we keep a list of (k+1)-mers as a succinct representation of an edge-induced graph; we keep a list of k-mers for a vertex-induced graph. Both ways work if we just want to derive unitigs. ~~I haven't implemented a de Bruijn graph assembler myself. I don't know which way is more convenient in practice. However, if we want to apply a Euler traversal, probably the edge-induced graph makes more sense. The extra edges in the vertex-induced graph ~~*might* fail the traversal (not so sure!).

**PS:** my reflections are broadly in line with Paul's answer, except that the definition of "order" is different. I am defining "order" as in the original de Bruijn graph; Paul is essentially taking "order" as the length of k-mers used to succinctly store the graph.

EDIT: according to wiki, a k-order full de Bruijn graph is the line graph of a (k-1)-order de Bruijn graph. I think this property still holds for assembly de Bruijn graphs. We note that a list of k-mers is a succinct representation of a vertex-induced k-order graph *AND* of an edge-induced (k-1)-order graph. It actually doesn't matter how we induce graph. The final assembly will be the same either way. This duality is also interesting in that if there is a Euler path in an edge-induced (k-1)-order graph, there will be a Hamilton path in the vertex-induced k-order graph.

•

link
modified 3.3 years ago
•
written
3.3 years ago by
lh3 ♦ **32k**
Another question: the DeBruijn Graph is NP-Hard or polynomial problem??

Thaks

30k• written 4.6 years ago by midox •260I don't understand the question? The dBG is a combinatorial structure, not a problem. Which problem are you asking about?

30k• written 4.6 years ago by Rob ♦4.2ksorry, I mean the assembly using de Bruijn graph ( find the eulerian path)?

260If there exists an Eulerian path in a given dBG, then such a path can be found in polynomial time. However, in real assembly graphs, such path usually do not exist (even in relatively simple genomes) see e.g. this paper. Furthermore, if such paths do exist, there can easily be an exponential number of them. This is why most dBG assemblers simply report the unambiguous contigs (after quite a bit of "fixing" and simplification). Anyway, the answer to your original question is that Eulerian Cycle and Eulerian Path are both in P.

30k• written 4.6 years ago by Rob ♦4.2k