I'm trying to figure out how KmerGenie is estimating which good k-mers to count in each k-mer frequency histogram.
Supposedly the script that counts all the good kmers lies in scripts/est-genomic-kmers.r. However, I don't have a good background in R. So I'm trying to get a more basic mathematical understanding of this estimation method so that I can replicate it Python. My end goal is to try to do the same thing by taking the histograms from jellyfish (bash command: jellyfish histo mer_counts.jf) and count develop a script to count them.
The other big issue is that I'm primarily running into is how to handle histogram frequency curves that DO NOT follow the characteristic camel hump curve. This is especially true if there are a low number of reads. Some of these histograms simply look like decay functions with no local minima or maxima. I don't quite know the theory on how to approach those problems.