5 months ago by
The graphs produced by K-mer genie are specificity vs sensitivity comparisons... depending on the kmer size you get a set of sequences that are unique vs the total amount of unique sequences.
If you have too little unique sequences, you can't expand your contigs... if they aren't unique enough you'll end up with a messy assembly where unrelated sequences are able to overlap...
So you will always end up with a choice between long contigs vs high quality contigs
Every assembler that uses de bruijn graphs will handle this differently and so to answer your question.
A) No, one single value from KmerGenie does not ensure the best assembly... so it's best to try different kmers and use stats like N50/L50, assembly size, read coverage and maybe gene annotations and just compare some different kmers to see how the stats differ between each run/assembler. Then picking whatever measurements you think are most important to your project to define what is the best assembly.
B) If you get Multiple peaks, keep in mind that the y-axis isn't 0-10 but it's often in the power^7 so small differences might actually be pretty big differences if you look at the numbers. Secondly the optimal number is just the highest point but should not be used as the definitive answer. This comes back to the sensitivity vs specificity... higher kmer means they become more specific (unique), ergo might result in higher quality contigs, although shorter contigs. So the choice depends on what you need/want from you assembly.
Are you building a de novo reference genome? Try a range of higher k-mers to get somewhat higher quality contigs
Want to do some basic GWAS analysis that don't require complete chromosomes? Try some of the lower peaks to get longer contigs and thus more data to mess around with
But in the end the differences will probably lie more in the assembler than in the kmer...