Question

How does kmer length affect total assembly size?

5

Entering edit mode

9.0 years ago

StarCute ▴ 110

How does the kmer length affect the total assembly size and the N50 statistic with single end-reads and why?

Assembly (velvet) • 15k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 9.0 years ago by StarCute ▴ 110

3

Entering edit mode

9.0 years ago

Antonio R. Franco ★ 5.1k

I would like to add something. I have assembled many times small genomes such as the E.coli one, and compared my assemblies (with different kmer values) with some trusted genomes using programs such as Mauve

In my hands..

Kmer values has direct effect on both, the N50 values you get and the final quality of your assemblies. Everybody knows that..
However, I have not seen a clear an direct correlation between the quality of the assembly and the N50 values. I mean that a longer N50 (something an user is almost desperately looking for) does not necessarily means you got a better assembly
I have used program that are trying guess with some anticipation which is the best kmer you need to use, such as kmergenie. However, at least in my hands, this is not working as well as I need it.
However, I believe that programs working with de Bruijn graphs, are better
Because all of this, I don't trust very much draft genomes
I hope one day some nice company end developing a nice sequencer able to render long sequences, with nice qualities, and high coberture, because the current sequencers fail in any or all of these, and this is actually causing a nightmare

ADD COMMENT • link updated 21 months ago by Ram 43k • written 9.0 years ago by Antonio R. Franco ★ 5.1k

2

Entering edit mode

On 4, for kb-long reads, overlap graph is usually better. On 6, with pacbio, you often get one contig for a whole bacterial genome. Recent preprint shows that you can achieve the similar with oxford nanopore.

ADD REPLY • link 9.0 years ago by lh3 33k

0

Entering edit mode

PacBio is terrific with bacterial genomes, that is true. Still poor data and very expensive for higher organisms. And I must confess I start loosing my hope with Oxford

Time will say..

ADD REPLY • link 9.0 years ago by Antonio R. Franco ★ 5.1k

score 6 · Accepted Answer · 2015-04-24

Very generally spoken:

With larger kmer size there is a better chance of avoiding ambiguities in the graph between similar regions (repeats, paralogs,...). Ambiguities occur if kmer exist multiple times within the genome. (Unresolvable) ambiguities terminate contigs, hence larger kmer sizes in theory increase N50. However, large kmer sizes are much more sensitive to sequencing errors, heterozygosity and coverage.

Assembly size depends on how the assembler handles small ambiguities (bubbles) in the graph and how it handles low coverage paths. With small kmer sizes and some sensitivity to bubbles, you are more likely to generate a single contig for a slightly noisy region. This might be good in case of SNPs or bad if you merge repeats, .... This effect leads to smaller assembly size for small kmers. With large kmers you are more likely to generate different fragments for exiting variants, which increases assembly size. However, if these fragments fail for example internal coverage or length cutoffs, the final assembly may actually be smaller.