Question: Is there a concensus on which k-mers should be counted in a histogram graph of kmer multiplicity vs frequency to estimate the genome size?
gravatar for Tom
21 months ago by
United States
Tom20 wrote:

I'm trying to develop my own algorithm to count the correct number of true k-mers that are apart of the genome, and exclude all the unique/singleton kmers that are the result of sequencing errors (or snps). The only question is, I'm not sure where would be the most accurate possible cutoff. I want to start counting at the first local minima of the camel hump graph, however, some the k-mers starting at that threshold are still considered "noise" kmers.


Also, is there a term to distinguish kmers that should be counted as part of the genome? I've just been calling them true-kmers, but am not able to find a formalized term for it yet.



jellyfish kmer kmergenie genome • 857 views
ADD COMMENTlink modified 20 months ago by trausch730 • written 21 months ago by Tom20

I call them "genomic kmers" and "error kmers".  Whether there is a consensus or not, the best place to draw the line depends on your specific dataset, and the relative impact of false-positives versus false-negatives for your particular purpose.  The concept of drawing a line at some specific threshold already forces a lot of assumptions on the data.

ADD REPLYlink modified 21 months ago • written 21 months ago by Brian Bushnell14k
gravatar for trausch
20 months ago by
trausch730 wrote:

Instead of using a cutoff you may want to model the k-mer count distribution as a mixture of Poisson distributions for genomic k-mers and artificial k-mers as proposed by sga preqc, preprint is here The preprint also discusses selection strategies for a suitable k.


ADD COMMENTlink modified 20 months ago • written 20 months ago by trausch730
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 513 users visited in the last hour