Is there a concensus on which k-mers should be counted in a histogram graph of kmer multiplicity vs frequency to estimate the genome size?
Entering edit mode
8.4 years ago
Tom ▴ 20

I'm trying to develop my own algorithm to count the correct number of true k-mers that are apart of the genome, and exclude all the unique/singleton kmers that are the result of sequencing errors (or snps). The only question is, I'm not sure where would be the most accurate possible cutoff. I want to start counting at the first local minima of the camel hump graph, however, some the k-mers starting at that threshold are still considered "noise" kmers.

Also, is there a term to distinguish kmers that should be counted as part of the genome? I've just been calling them true-kmers, but am not able to find a formalized term for it yet.


kmergenie kmer jellyfish genome • 2.4k views
Entering edit mode

I call them "genomic kmers" and "error kmers". Whether there is a consensus or not, the best place to draw the line depends on your specific dataset, and the relative impact of false-positives versus false-negatives for your particular purpose. The concept of drawing a line at some specific threshold already forces a lot of assumptions on the data.

Entering edit mode
8.3 years ago
trausch ★ 1.9k

Instead of using a cutoff you may want to model the k-mer count distribution as a mixture of Poisson distributions for genomic k-mers and artificial k-mers as proposed by sga preqc, preprint is here The preprint also discusses selection strategies for a suitable k.


Login before adding your answer.

Traffic: 3223 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6