Where does the choice of k = 51 for de Bruijn graphs of large genomes come from?
1
0
Entering edit mode
21 months ago

From hearsay I know that de Bruijn graphs of large genomes (e.g. human) are usually constructed with k = 51, or that k = 51 is at least a good initial choice.

I however am unable to find any source for this, does anyone know where it is coming from?

de-bruijn-graph • 758 views
ADD COMMENT
2
Entering edit mode
21 months ago

For which application and what sequencing technology?

For efficient alignment, k = 51 is clearly too big. For genome assembly, a k of 51 is still in a reasonable range, but already quite excessive. You can run KmerGenie to estimate the optimal size to assemble a given genome including its repeats. However, the larger the k, the fewer reads cover it, such that assemblies with large k-mer size sort of already resemble the greedy algorithm. Since memory is much less of a concern nowadays than it was considering the available compute hardware in the 1990ies, one can be a bit more permissive, but something in the range of 31-35 might do well for most assemblies nonetheless, in particular if your base call error rate isn't 0.

What is correct, however, is that odd k-mer sizes are usually preferable. An even k-mer length can generate DNA palindromes, which generates ambiguity in the de Bruijn graph.

ADD COMMENT
0
Entering edit mode

Thanks for the detailed answer! The application would be genome assembly of short reads. Well actually, what we are doing is storing a k-mer set in small space, so the question would be very general about any kind of k-mer based method. Then it is probably hard to answer though.

ADD REPLY
1
Entering edit mode

Well, unfortunately, the nitty-gritty details required for of algorithm design escape me. But I would recommend taking a look at:

In general, though, high quality genome assemblies nowadays use a combination of short-reads and long reads or Hi-C data. No whatsoever optimization regarding the k-mer size is going to provide you with similar gains in quality of the assembly like the incorporation of this additional information.

ADD REPLY

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6