I'm trying to figure out how KmerGenie is estimating which good k-mers to count in each k-mer frequency histogram.

Supposedly the script that counts all the good kmers lies in scripts/est-genomic-kmers.r. However, I don't have a good background in R. So I'm trying to get a more basic mathematical understanding of this estimation method so that I can replicate it Python. My end goal is to try to do the same thing by taking the histograms from jellyfish (bash command: jellyfish histo mer_counts.jf) and count develop a script to count them.

The other big issue is that I'm primarily running into is how to handle histogram frequency curves that DO NOT follow the characteristic camel hump curve. This is especially true if there are a low number of reads. Some of these histograms simply look like decay functions with no local minima or maxima. I don't quite know the theory on how to approach those problems.

I believe the pattern you are seeing is due to low genome coverage and/or repeats, and I'm sure others can comment to that point. My experience is that these genome size estimation methods based on k-mer frequency require high coverage, esp. with repetitive genomes like plants, otherwise you don't get the expected clean peaks.

It is in fact for extremely low coverage reads. I am trying to do a curve of genome estimates from low number or reads to high number of reads, to see after how many reads does my estimation finally level off at exactly my genome size value

I don't think it is possible to get genome size from k-mers from extremely low coverage data. I spent a lot of time working on this myself and I wish I could say there is a solution but I never got close to real genome size estimates, nor could I get a confident answer about this approach from the authors of different k-mer tools. I did work out a method that uses low coverage data and is accurate, but it does not involve k-mers. The method is described in this paper and makes use of transcriptome data, and it isn't necessary to have transcriptome data from each species.

The idea isn't so much to determine genome size given the number of low reads. In fact, we expect a low total yield value for low number of reads. The idea is to see what would happen if we take several reads at different subsample sizes and see if we can plot a curve to estimate; then take the extrapolated curve, and determine what percentage of the genome has probably been coverd

This is a bit too abstract for me to follow but I think I can grasp what you mean. It seems like you want to know the coverage at a particular point on a curve? This is inherently tied to knowing (or calculating) genome size, so again, I don't think there is a practical solution using k-mers but I'd be interested to see an attempt. If this is not what you meant then I'm not clear what you are trying to express.

That's pretty close to what I mean. I'm taking reads of high coverage of a known organism. I construct a curve that relates to how many genomic k-mers exist, vs how many reads are given. The graph will look like a half-dome with an asymtote that is equal to the genome size. Based on the curvature, you can tell how much of your reads cover the total organism's genome. It will give a scale of how complete your reads are for a given organism.

Here's an applied example. Say I have a read that I SUSPECT only covers half the genome of an unknown organism. I use the above methodology, and I get a curve. If the curve seems to stop right in the middle, or does not cap out, I can say that the read is incomplete and thus will need to go back to the lab to re-sequence. You get the idea. The curvature (i.e. the slope) correlates to coverage completeness

Kmergenie relies on the fact that the histogram has two components: a distribution of abundances that correspond to errors, and a distribution of abundances that correspond to correct kmers. It models those components in terms of known probability distributions (e.g. correct kmers are assumed to follow a Gaussian distribution) and finds their parameters using optimization.

Informally, in low coverage situations, it is difficult to find parameters of the correct kmers component, because in terms of abundance, they look like errors. Low-abundance kmers are either erroneous or correct, but both cases are statistically equally likely in this context. Thus, unless you know more about the error rate in your data (using e.g. a reference genome or some other means), taking exactly the same approach as kmergenie is likely to fail for low-coverage data (confirmed by experiments). If you do have prior information about the erroneous kmer abundance distribution (i.e. its shape and the total number of erroneous kmers), then with some modifications the kmergenie model could work, I think.

Also: jellyfish is fine, just letting you know that the DSK kmer counter can also produce histograms ;)

Thank you for your reply.

I'm interested in how kmergenie is able to sum the area under the curve from the minima and on using gaussian distribtion. I'm not too versed in R, so it was difficult for me to reverse engineer an algorithm to reimpliment it in python. Could you direct me to a source into how I can learn how to do this?

The way that I've been currently doing it is simply to find the local minima, and then start summing all k-mers after that point. I don't think that's a very accurate method, as I am losing a lot of signal (genomic/solid/true k-mers) that would otherwise be hidden underneath the noise areas.

You could have a look at Quake, It is the model Kmergenie is based on, so this may be a simpler point of entry. But it is also in R though. Since the source codes of both quake and kmergenie are available, just learning the basics of R might help you understand what the code is doing.