Entering edit mode
7.8 years ago
Pierre Lindenbaum
160k
I'm trying to implement a naive implementation of a de Bruijn graph.
For now I wonder how the reverse-complement sequences should be handled ( related: How to handle Reverse complement in De-Bruijn Graph ) . In my code I insert the kmers of a read as well as its' reverse-complement kmers. But I don't think that duplicating the graph is a correct way to handle things.
Here is my current javascript source . It may be just wrong.
Thanks,
Pierre
It's typical to store a canonical version of a kmer, to reduce memory usage. For example, compare a kmer to its reverse-complement and store only the one that is lesser; and upon lookup, only look for the canonical version. But storing both is simpler and sometimes faster, so there's nothing wrong with that. Also, I suggest discarding all the kmers that contain Ns; they cause headaches.
so in TTT/AAA I would only store the AAA : Then, when inserting a new node in the graph, i must take care of inserting on 5' or 3'.
but the graph contains the same information twice isn't it ?
Yes, storing twice as much information has disadvantages. But if you put all the kmers in a hashtable, and you want to test for the presence of a kmer, it's easy - just do a hash lookup. Whereas if you only store the canonical copy, you have to reverse-complement, compare, and then do a hash lookup. This can be slow depending on how efficient your reverse-complement is, particularly if you are representing kmers as Strings.
Normally, when building a DeBruijn graph from reads, you have to do 8 lookup operations per kmer - one for each possible next letter in each direction - so avoiding expensive reverse-complements during that phase can be useful. That said, I prefer to store kmers in a single canonical form.
The list of kmers contains the same information twice, yes, but any given graph would be end up being the same.