Question: confusion in the DeBruijn Graph
3
midox240 wrote:

hello,
in the DeBruijn Graph the nodes(vertices) are le k-mer or the (k-1)mer??
there are works which shows that the nodes(vertices) are k-mer and onther works shows that the nodes(vertices) are k-1 mer !!
it's ambiguous.

graph assembly debruijn • 2.3k views
modified 2.5 years ago by lh331k • written 3.8 years ago by midox240
5
Paul Medvedev60 wrote:

There is no standard definition of de Bruijn graphs (dbgs) in the bioinformatics literature. Multiple papers define it multiple ways. However, there is always an implicit set of equal length strings as input (denote this length by p). For example, if your input is a set of reads of length greater than p, then the implicit input is the set of all p-mers in the reads.

I think the definitions of dbgs in bioinformatics boil down to one of two versions: 1) The edge-centric de Bruijn graph of order p: The nodes are (p-1)-mers in the input, and the edges are all the p-mers in the input. In other words: if we take any two nodes x and y that overlap by p-2, we can merge them together to obtain a p-mer. If that p-mer is in the input, then there is an edge in the graph. Note in this case, there might be nodes that overlap by p - 2 but do not have an edge between them.

2) The node-centric de Bruijn graph of order p: The nodes are the p-mers in the input, and there is an edge between two nodes (i.e. p-mers) if and only if they overlap by p-1.

I am not sure if the terms node-centric and edge-centric are standard, but we've been using them in our papers because they make explicit the distinction between the two graphs.

There can also be confusion in the way people use the value k. For example, sometimes, the authors define a dbg with nodes as k-1 mers and edges for any k-2 overlap between nodes. This is really defining a node centric dbg with p=k-1. Or, people will define a dbg with nodes as k-mers and an edge between two nodes if they overlap by k-1 and their merging leads to a (k+1)-mer in the input. This is really the edge-centric dbg with p=k+1.

To distinguish which type of graph is being used, check if the definition of the edge-set relies strictly on the vertex labels (then, this is a node centric dbg) or if it is using information that is not contained in the node labels (then it is an edge-centric graph). Then, figure out how their value of k relates to the value of p above.

Finally, it should be noted that there is a third definition of dbgs that is not used in bioinformatics but comes from mathematics and predates the definitions above. In this definition, the dbg of order p is the node centric dbg of order p, defined on the universe of all p-mers. That is, all possible strings of length p are in the graph. For this third definition of a dbg, there is no input sequence.

4
Rob3.6k wrote:

Most typically, the de Bruijn graph is considered to be built with the k-mers as edges [clarification: As Heng Points out below, the k-mers actually represent an "edge" and both endpoints.  Technically, the graph is on (k-1)-mers (nodes), which share an edge (k-2)-mer, where a length (k-2) suffix of one is a length (k-2) prefix of the other].  Then, each (k-2)-mer edge links two vertices which represent the "left" and "right" (k-1)-mers of this k-mer.  This representation has a number of benefits.  For example, the set of k-mers in the input can act as a representation of the dBG (since it is, essentially, the edge list of the graph).  This, in turn, leads to efficient representations in terms of things like cascading Bloom filters.  The edge list becomes a sufficient representation of the the graph.  To navigate in this structure (given some initial (k-1)-mer), you can query to see which edges are present by extending this (k-1)-mer to the left or right by a single nucleotide.  This produces a k-mer which you can then use to query the edge set.  It's worth noting that I have seen implementations of the dBG in terms of the node set (i.e. the "node-centric" representation), but the edge-based representation is what I've seen occur most frequently in exposition and implementation.

4

Rob's answer above is valid, but is only one of the existing definitions of de Bruijn graphs in the literature. I actually disagree that it is the most typical. Consider the representations of Conway-Bromage, Minia and cascading Bloom filters, Bowe's succinct dBG. They all define nodes as k-mers, and edges as exact (k-1)-overlaps between nodes.

Anyway, this is only a matter of definitions. In the end, the important property of de Bruijn graphs is that nodes are x-mers for some x, and edges have the property that there is an exact (x-1)-overlap. Whether x=k or x=k-1 does not matter, as k is a parameter. In some definitions, such as Conway-Bromage's, edge (x+1)-mers (formed by overlapping sequences of two nodes in an edge) also need to be present in the data.

1

If we think dBG as an overlap graph, that is (k-1)-mer reads as vertices with (k-2)-mer overlaps as edges. Nonetheless, probably your view is easier for implementation. I don't have first-hand experiences on implementing dBG.

EDIT: actually the list of (k-1)-mer is also a sufficient representation. Given a starting (k-1)-mer, you can cut 1 base from one end and then test whether a 1bp extension from the other end leads to a present (k-1)-mer.

1

You're right — if we think in terms of the implied overlap, then it will be a (k-2)-mer.  The k-mer actually encodes the two vertices and the implied edge between them.

Generally speaking, in graphs, is my understanding correct: nodes hold nouns (data) and edges hold verbs (the satisfied condition that two nodes are indeed connected, which in this case is a k-2 overlap)?

If my understanding is correct, is the (k-2)mer not stored explicitly, just deduced from the two (k-1)mers it connects? If not, what do the nodes and edges store?

So to summarize, the K-mer are the edges and (k-1)mers are the nodes?

1

The nodes are (k-1)-mers and edges are overlaps of length (k-2).  Specifically, these overlaps have the property that the length (k-2) edge is both a suffix of the source node and a prefix of the target node.  Those are the "objects" in the graph.  What I discussed in addition to this in the answer above, is that the set of k-mers is often used as a representation of all of this information.  Given a k-mer, its left and right (k-1)-mers become nodes in the dBG.  Since these (k-1)-mers are drawn directly from the prefix and suffix of a given k-mer, they will always overlap in the prescribed manner (the length k-2 suffix of the left (k-1)-mer will be a length k-2 prefix of the right (k-1)-mer).

TLDR: The nodes of the dBG are (k-1)-mers and the edges are (k-2)-mers that link them.  The original set of k-mers contains the information in the resulting dBG, and is a useful representation of the graph.

3
lh331k wrote:

On the other bioinfo Q&A, one of the best questions is about the difference between mathematical and bioinformatics de Bruijn graphs. That question made me think a lot. Here are some of my reflections.

According to wiki, in a k-order "mathematical" de Bruijn graph over alphabet {A,C,G,T}, each vertex is a k-mer; there is an edge between two k-mers if and only if they have a (k-1)-bp overlap. Each edge can be uniquely identified by a (k+1)-mer and vice versa. As such, this graph always has `4^k` vertices and `4^{k+1}` edges. In the following, I will call this a full de Bruijn graph.

In the context of sequence assembly, we have a set of reads as input. The k-order assembly de Buijn graph derived from the reads is strictly a subgraph of the full graph. It is either edge-induced from (k+1)-mers present in the input, or vertex-induced from k-mers. Vertex-induced and edge-induced graphs are different, though: vertex-induction requires at least (k-1)-bp overlap between reads while edge-induction requires at least k-bp overlap – edge-induced graph is smaller.

Here is an example. Say we have two 3-bp reads {ACC,CCG} and want to construct a 3-order assembly graph. If we use edge-induction, the graph is empty because there are no 4-mers in reads. If we use vertex-induction, the graph is "ACC=>CCG" because inducing the two vertices automatically induces the edge between them.

Back to the original question. Mathematically, a k-order assembly de Buijn graph is a subgraph of the full k-order de Bruijn graph. As such, in the assembly graph, vertices are k-mers. What vertices and edges are present in this subgraph depends on how the graph is constructed: by edge-induction or by vertex-induction. In implementation, we keep a list of (k+1)-mers as a succinct representation of an edge-induced graph; we keep a list of k-mers for a vertex-induced graph. Both ways work if we just want to derive unitigs. I haven't implemented a de Bruijn graph assembler myself. I don't know which way is more convenient in practice. However, if we want to apply a Euler traversal, probably the edge-induced graph makes more sense. The extra edges in the vertex-induced graph might fail the traversal (not so sure!).

PS: my reflections are broadly in line with Paul's answer, except that the definition of "order" is different. I am defining "order" as in the original de Bruijn graph; Paul is essentially taking "order" as the length of k-mers used to succinctly store the graph.

EDIT: according to wiki, a k-order full de Bruijn graph is the line graph of a (k-1)-order de Bruijn graph. I think this property still holds for assembly de Bruijn graphs. We note that a list of k-mers is a succinct representation of a vertex-induced k-order graph AND of an edge-induced (k-1)-order graph. It actually doesn't matter how we induce graph. The final assembly will be the same either way. This duality is also interesting in that if there is a Euler path in an edge-induced (k-1)-order graph, there will be a Hamilton path in the vertex-induced k-order graph.

1
midox240 wrote:

Another question:
the DeBruijn Graph is NP-Hard or polynomial problem??
Thaks

1

I don't understand the question?  The dBG is a combinatorial structure, not a problem.  Which problem are you asking about?

1

sorry, I mean the assembly using de Bruijn graph ( find the eulerian path)?

2

If there exists an Eulerian path in a given dBG, then such a path can be found in polynomial time.  However, in real assembly graphs, such path usually do not exist (even in relatively simple genomes) see e.g. this paper.  Furthermore, if such paths do exist, there can easily be an exponential number of them.  This is why most dBG assemblers simply report the unambiguous contigs (after quite a bit of "fixing" and simplification).  Anyway, the answer to your original question is that Eulerian Cycle and Eulerian Path are both in P.

0
RamRS24k wrote:

Each node is a k-mer. Adjacent k-mers overlap over a length of k-1. This means that navigating along each path, you'll move one base at a time with your reading frame size remaining at k.

2

No, Ram, you were correct also.

1

Oh. I was worried that my knowledge of dBGs had gone totally rusty :)

1

thank you for your explanation but in:

Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotech 29 (11):987-991. doi: 10.1038/nbt.2023

`Briefly, construct a graph B (the original graph called a de Bruijn graph) for which every possible (k – 1)-mer is assigned to a node; connect one (k – 1)-mer by a directed edge to a second (k – 1)-mer if there is some k-mer whose prefix is the former and whose suffix is the latter (Fig. 2). Edges of the de Bruijn graph represent all possible k-mers, and thus an Eulerian cycle in B represents a shortest (cyclic) superstring that contains each k-mer exactly once.`

i think there are a confusions.

1

This is the edge centric de Bruijn graph of order k -1 (See my answer below, with p = k -1)