I don't know any easy way to choose optimal values for K and k; probably the most practical thing is to do is assemble with a range of values and see what gives the best results.
Here is an explanation of the paired de Bruijn graph idea, though, which may help with your intuition. Consider an example k-mer size of 8, e.g:
With ABySS paired de Bruijn graph, the parameters K=3 (individual k-mer size) and k=8 (k-mer pair span) would give you a paired k-mer like this:
where the N's are "wildcard" positions (in other words they can match any base). The intuition of the paired de Bruijn is that, in most cases, the standard 8-mer and the paired-kmer will match the same places in the genome, so it doesn't matter that we are "throwing away" those bases in the middle.
By that reasoning, 'k' (k-mer pair span) is generally a more important parameter than 'K' (individual k-mer size). So when you are doing your parameter sweep, it it probably best to fix 'K' at some value (e.g. 32) and assemble with a range of 'k' values. Then, when you find a good 'k', you can try adjusting 'K'.
The paired de Bruijn graph is contructed by creating a node for each paired k-mer and an edge between that nodes that have *paired* overlaps.
The paired de Bruijn graph idea was described in:
Medvedev, Paul, et al. "Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers." Journal of Computational Biology 18.11 (2011): 1625-1634. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3216098/
Also, this book has some nice diagrams explaining the idea:
Jones, Neil C., and Pavel Pevzner. An introduction to bioinformatics algorithms. MIT press, 2004. Website: http://bioinformaticsalgorithms.com/index.htm
Pavel Pevzner et al. also had an online course that explained the paired de Bruijn, but I can't seem to find it anymore. Maybe you will have better luck.
4.6 years ago by
benv • 710