Question: Is there a way to calculate the maximum length at which the frequencies of k-mers can be accurately estimated?

0

Joseph Hughes •

**2.8k**wrote:A reviewer has asks us to calculate the theoretical range of optimal k-mer sizes for a given database of viral sequences, stating: "The optimal range lies between: the minimum size for which a maximum number of different features can be found in the string (the viral genome); and the maximum length at which the frequencies of k-mers can be accurately estimated"

I have been trawling the web trying to find an answer to this question and read this paper by Sims et al (2009), however it is not entirely clear to me how from an eclectic set of unaligned sequences in my database, I can calculate this theoretical minimum and maximum k-mer sizes.

Presumably the limitation on 'accuracy' estimation, isn't actually accuracy, and is more like computability? In principle you should be able to calculate exactly the number of kmers of any size, assuming the scaling hasn't rendered it impractical to compute.

Could you perhaps just plot a distribution for varying kmer sizes and perhaps extrapolate to the point where it begins to plateau or no new information is/kmers are added? (If I understand the question).

16kWhat do you mean by information? Entropy?

2.8kNo, sorry for the confusion. I just mean until essentially the distribution begins to fall away (at the most extreme example, the longest kmer you can get would be the whole genome with an occurrence of 1), so the longer your kmers become, the less frequent they must be. This should give you a negative gradient on a graph of kmer length vs occurrence, and extrapolating from that to find a maximum 'meaningful' length might be enough to placate your reviewer?

(Mostly thinking aloud, rather than providing a qualified answer in any real way!)

16k