Question: gap size between k-mer pairs
gravatar for balaani
5.5 years ago by
balaani30 wrote:


I've been trying to run ABySS on PE data (~90GB) from a plant genomic sample with a k-mer size of 70 (k=70) on an HPC system. For memory concerns, I would like to use paired de Bruijn approach; however, I am not sure how to pick appropriate individual k-mer size (K) for a k-mer pair span (k) of 70 (or, any K&k values to get a close result to single k=70). I suppose larger gap between pairs means less memory, but is there a way to estimate K for a reasonable gap size?

[For those who may have some more spare time;

In fact, I don't think I get the idea with the k-mer pairs at all. How can the gap be larger than the K value (in the readme, for E.coli there is an example with K=16 and k=64, making the size of the gap 32). In this mode, does the graph again looks for K-1 overlaps? Any explanations or sources, from where I can read more, will be of great help.]

Many thanks in advance, any help is greatly appreciated.


abyss assembly • 2.6k views
ADD COMMENTlink modified 5.5 years ago by benv720 • written 5.5 years ago by balaani30
gravatar for benv
5.5 years ago by
benv720 wrote:

Hi Ani,

I don't know any easy way to choose optimal values for K and k; probably the most practical thing is to do is assemble with a range of values and see what gives the best results.

Here is an explanation of the paired de Bruijn graph idea, though, which may help with your intuition.  Consider an example k-mer size of 8, e.g:


With ABySS paired de Bruijn graph, the parameters K=3 (individual k-mer size) and k=8 (k-mer pair span) would give you a paired k-mer like this:


where the N's are "wildcard" positions (in other words they can match any base). The intuition of the paired de Bruijn is that, in most cases, the standard 8-mer and the paired-kmer will match the same places in the genome, so it doesn't matter that we are "throwing away" those bases in the middle.

By that reasoning, 'k' (k-mer pair span) is generally a more important parameter than 'K' (individual k-mer size).  So when you are doing your parameter sweep, it it probably best to fix 'K' at some value (e.g. 32) and assemble with a range of 'k' values.  Then, when you find a good 'k', you can try adjusting 'K'.

The paired de Bruijn graph is contructed by creating a node for each paired k-mer and an edge between that nodes that have *paired* overlaps.

The paired de Bruijn graph idea was described in:

Medvedev, Paul, et al. "Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers." Journal of Computational Biology 18.11 (2011): 1625-1634. URL:

Also, this book has some nice diagrams explaining the idea:

Jones, Neil C., and Pavel Pevzner. An introduction to bioinformatics algorithms. MIT press, 2004.  Website:

Pavel Pevzner et al. also had an online course that explained the paired de Bruijn, but I can't seem to find it anymore.  Maybe you will have better luck.

ADD COMMENTlink written 5.5 years ago by benv720

Dear Ben,

Many thanks for your response, the whole concept is much more clear to me now.


ADD REPLYlink modified 15 months ago by Ram32k • written 5.5 years ago by balaani30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1145 users visited in the last hour