Question: gap size between k-mer pairs
3
gravatar for balaani
3.6 years ago by
balaani30
Turkey
balaani30 wrote:

Hi,

I've been trying to run ABySS on PE data (~90GB) from a plant genomic sample with a k-mer size of 70 (k=70) on an HPC system. For memory concerns, I would like to use paired de Bruijn approach; however, I am not sure how to pick appropriate individual k-mer size (K) for a k-mer pair span (k) of 70 (or, any K&k values to get a close result to single k=70). I suppose larger gap between pairs means less memory, but is there a way to estimate K for a reasonable gap size?

[For those who may have some more spare time;

In fact, I don't think I get the idea with the k-mer pairs at all. How can the gap be larger than the K value (in the readme, for E.coli there is an example with K=16 and k=64, making the size of the gap 32). In this mode, does the graph again looks for K-1 overlaps? Any explanations or sources, from where I can read more, will be of great help.]

Many thanks in advance, any help is greatly appreciated.
Best,
Ani


 

abyss assembly • 2.0k views
ADD COMMENTlink modified 3.5 years ago by benv710 • written 3.6 years ago by balaani30
1
gravatar for benv
3.5 years ago by
benv710
Canada
benv710 wrote:

Hi Ani,

I don't know any easy way to choose optimal values for K and k; probably the most practical thing is to do is assemble with a range of values and see what gives the best results.

Here is an explanation of the paired de Bruijn graph idea, though, which may help with your intuition.  Consider an example k-mer size of 8, e.g:

ACGTACGT

With ABySS paired de Bruijn graph, the parameters K=3 (individual k-mer size) and k=8 (k-mer pair span) would give you a paired k-mer like this:

ACGNNCGT

where the N's are "wildcard" positions (in other words they can match any base). The intuition of the paired de Bruijn is that, in most cases, the standard 8-mer and the paired-kmer will match the same places in the genome, so it doesn't matter that we are "throwing away" those bases in the middle.

By that reasoning, 'k' (k-mer pair span) is generally a more important parameter than 'K' (individual k-mer size).  So when you are doing your parameter sweep, it it probably best to fix 'K' at some value (e.g. 32) and assemble with a range of 'k' values.  Then, when you find a good 'k', you can try adjusting 'K'.

The paired de Bruijn graph is contructed by creating a node for each paired k-mer and an edge between that nodes that have *paired* overlaps.

The paired de Bruijn graph idea was described in:

Medvedev, Paul, et al. "Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers." Journal of Computational Biology 18.11 (2011): 1625-1634. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3216098/

Also, this book has some nice diagrams explaining the idea:

Jones, Neil C., and Pavel Pevzner. An introduction to bioinformatics algorithms. MIT press, 2004.  Website: http://bioinformaticsalgorithms.com/index.htm

Pavel Pevzner et al. also had an online course that explained the paired de Bruijn, but I can't seem to find it anymore.  Maybe you will have better luck.

ADD COMMENTlink written 3.5 years ago by benv710

Dear Ben,

Many thanks for your response, the whole concept is much more clear to me now. 

Best,

ADD REPLYlink written 3.5 years ago by balaani30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1147 users visited in the last hour