Question: Is there a concensus on which k-mers should be counted in a histogram graph of kmer multiplicity vs frequency to estimate the genome size?
gravatar for Tom
3.6 years ago by
United States
Tom20 wrote:

I'm trying to develop my own algorithm to count the correct number of true k-mers that are apart of the genome, and exclude all the unique/singleton kmers that are the result of sequencing errors (or snps). The only question is, I'm not sure where would be the most accurate possible cutoff. I want to start counting at the first local minima of the camel hump graph, however, some the k-mers starting at that threshold are still considered "noise" kmers.


Also, is there a term to distinguish kmers that should be counted as part of the genome? I've just been calling them true-kmers, but am not able to find a formalized term for it yet.



jellyfish kmer kmergenie genome • 1.4k views
ADD COMMENTlink modified 3.5 years ago by trausch1.3k • written 3.6 years ago by Tom20

I call them "genomic kmers" and "error kmers".  Whether there is a consensus or not, the best place to draw the line depends on your specific dataset, and the relative impact of false-positives versus false-negatives for your particular purpose.  The concept of drawing a line at some specific threshold already forces a lot of assumptions on the data.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Brian Bushnell16k
gravatar for trausch
3.5 years ago by
trausch1.3k wrote:

Instead of using a cutoff you may want to model the k-mer count distribution as a mixture of Poisson distributions for genomic k-mers and artificial k-mers as proposed by sga preqc, preprint is here The preprint also discusses selection strategies for a suitable k.


ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by trausch1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 602 users visited in the last hour