Question

gap size between k-mer pairs

3

Entering edit mode

8.7 years ago

balaani ▴ 30

Hi,

I've been trying to run ABySS on PE data (~90GB) from a plant genomic sample with a k-mer size of 70 (k=70) on an HPC system. For memory concerns, I would like to use paired de Bruijn approach; however, I am not sure how to pick appropriate individual k-mer size (K) for a k-mer pair span (k) of 70 (or, any K&k values to get a close result to single k=70). I suppose larger gap between pairs means less memory, but is there a way to estimate K for a reasonable gap size?

[For those who may have some more spare time;

In fact, I don't think I get the idea with the k-mer pairs at all. How can the gap be larger than the K value (in the readme, for E.coli there is an example with K=16 and k=64, making the size of the gap 32). In this mode, does the graph again looks for K-1 overlaps? Any explanations or sources, from where I can read more, will be of great help.]

Many thanks in advance, any help is greatly appreciated.

Best,
Ani

abyss assembly • 3.3k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by balaani ▴ 30

Ram · Accepted Answer · 2015-09-08

Hi Ani,

I don't know any easy way to choose optimal values for K and k; probably the most practical thing is to do is assemble with a range of values and see what gives the best results.

Here is an explanation of the paired de Bruijn graph idea, though, which may help with your intuition. Consider an example k-mer size of 8, e.g:

ACGTACGT

With ABySS paired de Bruijn graph, the parameters K=3 (individual k-mer size) and k=8 (k-mer pair span) would give you a paired k-mer like this:

ACGNNCGT

where the N's are "wildcard" positions (in other words they can match any base). The intuition of the paired de Bruijn is that, in most cases, the standard 8-mer and the paired-kmer will match the same places in the genome, so it doesn't matter that we are "throwing away" those bases in the middle.

By that reasoning, 'k' (k-mer pair span) is generally a more important parameter than 'K' (individual k-mer size). So when you are doing your parameter sweep, it it probably best to fix 'K' at some value (e.g. 32) and assemble with a range of 'k' values. Then, when you find a good 'k', you can try adjusting 'K'.

The paired de Bruijn graph is contructed by creating a node for each paired k-mer and an edge between that nodes that have paired overlaps.

The paired de Bruijn graph idea was described in:

Medvedev, Paul, et al. "Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers." Journal of Computational Biology 18.11 (2011): 1625-1634. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3216098/

Also, this book has some nice diagrams explaining the idea:

Jones, Neil C., and Pavel Pevzner. An introduction to bioinformatics algorithms. MIT press, 2004. Website: http://bioinformaticsalgorithms.com/index.htm

Pavel Pevzner et al. also had an online course that explained the paired de Bruijn, but I can't seem to find it anymore. Maybe you will have better luck.