Question: Implementing a naive javascript-based de Bruijn graph
3
gravatar for Pierre Lindenbaum
2.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

I'm trying to implement a naive implementation of a de Bruijn graph.

For now I wonder how the reverse-complement sequences should be handled ( related: How to handle Reverse complement in De-Bruijn Graph ) . In my code I insert the kmers of a read as well as its' reverse-complement kmers. But I don't think that duplicating the graph is a correct way to handle things.

Here is my current javascript source . It may be just wrong.

Thanks,

Pierre

ADD COMMENTlink written 2.8 years ago by Pierre Lindenbaum118k
2

It's typical to store a canonical version of a kmer, to reduce memory usage. For example, compare a kmer to its reverse-complement and store only the one that is lesser; and upon lookup, only look for the canonical version. But storing both is simpler and sometimes faster, so there's nothing wrong with that. Also, I suggest discarding all the kmers that contain Ns; they cause headaches.

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

compare a kmer to its reverse-complement and store only the one that is lesser

so in TTT/AAA I would only store the AAA : Then, when inserting a new node in the graph, i must take care of inserting on 5' or 3'.

But storing both is simpler and sometimes faster

but the graph contains the same information twice isn't it ?

ADD REPLYlink written 2.8 years ago by Pierre Lindenbaum118k
2

Yes, storing twice as much information has disadvantages. But if you put all the kmers in a hashtable, and you want to test for the presence of a kmer, it's easy - just do a hash lookup. Whereas if you only store the canonical copy, you have to reverse-complement, compare, and then do a hash lookup. This can be slow depending on how efficient your reverse-complement is, particularly if you are representing kmers as Strings.

Normally, when building a DeBruijn graph from reads, you have to do 8 lookup operations per kmer - one for each possible next letter in each direction - so avoiding expensive reverse-complements during that phase can be useful. That said, I prefer to store kmers in a single canonical form.

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k
1

The list of kmers contains the same information twice, yes, but any given graph would be end up being the same.

ADD REPLYlink written 2.8 years ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour