Question

Is there a concensus on which k-mers should be counted in a histogram graph of kmer multiplicity vs frequency to estimate the genome size?

0

Entering edit mode

8.7 years ago

Tom ▴ 20

I'm trying to develop my own algorithm to count the correct number of true k-mers that are apart of the genome, and exclude all the unique/singleton kmers that are the result of sequencing errors (or snps). The only question is, I'm not sure where would be the most accurate possible cutoff. I want to start counting at the first local minima of the camel hump graph, however, some the k-mers starting at that threshold are still considered "noise" kmers.

Also, is there a term to distinguish kmers that should be counted as part of the genome? I've just been calling them true-kmers, but am not able to find a formalized term for it yet.

Reference: http://pritchardlab.stanford.edu/publications/pdfs/Melsted11.pdf

kmergenie kmer jellyfish genome • 2.5k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 8.7 years ago by Tom ▴ 20

2

Entering edit mode

I call them "genomic kmers" and "error kmers". Whether there is a consensus or not, the best place to draw the line depends on your specific dataset, and the relative impact of false-positives versus false-negatives for your particular purpose. The concept of drawing a line at some specific threshold already forces a lot of assumptions on the data.

ADD REPLY • link updated 4.7 years ago by Ram 44k • written 8.7 years ago by Brian Bushnell 20k

Ram · Answer 1 · 2016-01-10

1

Entering edit mode

8.6 years ago

trausch ★ 1.9k

Instead of using a cutoff you may want to model the k-mer count distribution as a mixture of Poisson distributions for genomic k-mers and artificial k-mers as proposed by sga preqc, preprint is here http://arxiv.org/pdf/1307.8026v1.pdf. The preprint also discusses selection strategies for a suitable k.

ADD COMMENT • link updated 4.7 years ago by Ram 44k • written 8.6 years ago by trausch ★ 1.9k