13 months ago by

United States

On the other bioinfo Q&A, one of the best questions is about the difference between mathematical and bioinformatics de Bruijn graphs. That question made me think a lot. Here are some of my reflections.

According to wiki, in a k-order "mathematical" de Bruijn graph over alphabet {A,C,G,T}, each vertex is a k-mer; there is an edge between two k-mers if and only if they have a (k-1)-bp overlap. Each edge can be uniquely identified by a (k+1)-mer and vice versa. As such, this graph *always* has `4^k`

vertices and `4^{k+1}`

edges. In the following, I will call this a full de Bruijn graph.

In the context of sequence assembly, we have a set of reads as input. The k-order assembly de Buijn graph derived from the reads is strictly a subgraph of the full graph. It is either *edge-induced* from (k+1)-mers present in the input, or *vertex-induced* from k-mers. Vertex-induced and edge-induced graphs are different, though: vertex-induction requires at least (k-1)-bp overlap between reads while edge-induction requires at least k-bp overlap – edge-induced graph is smaller.

Here is an example. Say we have two 3-bp reads {ACC,CCG} and want to construct a 3-order assembly graph. If we use edge-induction, the graph is empty because there are no 4-mers in reads. If we use vertex-induction, the graph is "ACC=>CCG" because inducing the two vertices automatically induces the edge between them.

Back to the original question. Mathematically, a k-order assembly de Buijn graph is a subgraph of the full k-order de Bruijn graph. As such, in the assembly graph, vertices are k-mers. What vertices and edges are present in this subgraph depends on how the graph is constructed: by edge-induction or by vertex-induction. In implementation, we keep a list of (k+1)-mers as a succinct representation of an edge-induced graph; we keep a list of k-mers for a vertex-induced graph. Both ways work if we just want to derive unitigs. ~~I haven't implemented a de Bruijn graph assembler myself. I don't know which way is more convenient in practice. However, if we want to apply a Euler traversal, probably the edge-induced graph makes more sense. The extra edges in the vertex-induced graph ~~*might* fail the traversal (not so sure!).

**PS:** my reflections are broadly in line with Paul's answer, except that the definition of "order" is different. I am defining "order" as in the original de Bruijn graph; Paul is essentially taking "order" as the length of k-mers used to succinctly store the graph.

EDIT: according to wiki, a k-order full de Bruijn graph is the line graph of a (k-1)-order de Bruijn graph. I think this property still holds for assembly de Bruijn graphs. We note that a list of k-mers is a succinct representation of a vertex-induced k-order graph *AND* of an edge-induced (k-1)-order graph. It actually doesn't matter how we induce graph. The final assembly will be the same either way. This duality is also interesting in that if there is a Euler path in an edge-induced (k-1)-order graph, there will be a Hamilton path in the vertex-induced k-order graph.

•

link
modified 13 months ago
•
written
13 months ago by
lh3 ♦ **31k**