Question: Is there a way to calculate the maximum length at which the frequencies of k-mers can be accurately estimated?
0
gravatar for Joseph Hughes
8 weeks ago by
Joseph Hughes2.7k
Scotland, UK
Joseph Hughes2.7k wrote:

A reviewer has asks us to calculate the theoretical range of optimal k-mer sizes for a given database of viral sequences, stating: "The optimal range lies between: the minimum size for which a maximum number of different features can be found in the string (the viral genome); and the maximum length at which the frequencies of k-mers can be accurately estimated"

I have been trawling the web trying to find an answer to this question and read this paper by Sims et al (2009), however it is not entirely clear to me how from an eclectic set of unaligned sequences in my database, I can calculate this theoretical minimum and maximum k-mer sizes.

frequency k-mer • 214 views
ADD COMMENTlink written 8 weeks ago by Joseph Hughes2.7k

Presumably the limitation on 'accuracy' estimation, isn't actually accuracy, and is more like computability? In principle you should be able to calculate exactly the number of kmers of any size, assuming the scaling hasn't rendered it impractical to compute.

Could you perhaps just plot a distribution for varying kmer sizes and perhaps extrapolate to the point where it begins to plateau or no new information is/kmers are added? (If I understand the question).

ADD REPLYlink written 8 weeks ago by jrj.healey13k

What do you mean by information? Entropy?

ADD REPLYlink written 8 weeks ago by Joseph Hughes2.7k

No, sorry for the confusion. I just mean until essentially the distribution begins to fall away (at the most extreme example, the longest kmer you can get would be the whole genome with an occurrence of 1), so the longer your kmers become, the less frequent they must be. This should give you a negative gradient on a graph of kmer length vs occurrence, and extrapolating from that to find a maximum 'meaningful' length might be enough to placate your reviewer?

(Mostly thinking aloud, rather than providing a qualified answer in any real way!)

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by jrj.healey13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour