Is there a concensus on which k-mers should be counted in a histogram graph of kmer multiplicity vs frequency to estimate the genome size?
1
0
Entering edit mode
8.9 years ago
Tom ▴ 20

I'm trying to develop my own algorithm to count the correct number of true k-mers that are apart of the genome, and exclude all the unique/singleton kmers that are the result of sequencing errors (or snps). The only question is, I'm not sure where would be the most accurate possible cutoff. I want to start counting at the first local minima of the camel hump graph, however, some the k-mers starting at that threshold are still considered "noise" kmers.

Also, is there a term to distinguish kmers that should be counted as part of the genome? I've just been calling them true-kmers, but am not able to find a formalized term for it yet.

Reference: http://pritchardlab.stanford.edu/publications/pdfs/Melsted11.pdf

kmergenie kmer jellyfish genome • 2.5k views
ADD COMMENT
2
Entering edit mode

I call them "genomic kmers" and "error kmers". Whether there is a consensus or not, the best place to draw the line depends on your specific dataset, and the relative impact of false-positives versus false-negatives for your particular purpose. The concept of drawing a line at some specific threshold already forces a lot of assumptions on the data.

ADD REPLY
1
Entering edit mode
8.8 years ago
trausch ★ 1.9k

Instead of using a cutoff you may want to model the k-mer count distribution as a mixture of Poisson distributions for genomic k-mers and artificial k-mers as proposed by sga preqc, preprint is here http://arxiv.org/pdf/1307.8026v1.pdf. The preprint also discusses selection strategies for a suitable k.

ADD COMMENT

Login before adding your answer.

Traffic: 1970 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6