Question: Where does Kmergenie get it's y axis numbers from? Is it suppose to be the local maxima of each plot?
gravatar for Tom
22 months ago by
United States
Tom20 wrote:

This may sound rudimentary, but where is KmerGenie getting it's y-axis (number of genomic k-mers) numbers from when it goes to plot the final plot in the HTML output?

I was under the assumption that it was taking the local maximum of each k histogram's curve, and taking that number and putting it into the Kmerplot. However, my numbers aren't lining up with the histogram's numbers; I have no idea what's going on. What happens when my where should I be looking to calculate the y-value for each kmer in the final graph?

myposts kmer kmergenie • 549 views
ADD COMMENTlink written 22 months ago by Tom20

If you mean the first graph, then no. It's not a local maxima. I believe it's the area under the graph, but there my be a cutoff, like x > 1.

ADD REPLYlink written 22 months ago by apelin20460

Can you make an educated guess as to where its integrating from and to?

ADD REPLYlink written 22 months ago by Tom20

Send an email to the author, he usually replies pretty quickly.

ADD REPLYlink written 22 months ago by apelin20460

For some reason I don't get biostars email notifications anymore.

Adrian (-- I'm assuming it's him) is right, it's indeed an area under the histogram curve. It is weighted by the probability that, for a given abundance, a kmer is erroneous or not.

E.g. if you have 10 kmers of abundance 1, but the model thinks that the density of erroneous kmers at abudance 1 is 0.7, kmergenie will predict (1-0.7)*10 = 3 genomic kmers for this abundance. Then kmergenie sums the predicted number of genomic kmers over all abundances.

This is for the haploid model; for the diploid model, it's slightly more advanced, to take into account heterozygous kmers (and divide their contribution by two).

ADD REPLYlink modified 22 months ago • written 22 months ago by Rayan Chikhi1.2k

Thank you for your input.


So considering that camel hump graph that's generated for say, k = 35. All data after the first "dip" should be ignored. Then, I should take the area under the curve from that first local minimum, all the way to the right to the tail end of the graph. That should in theory yield the number of bases equal to my genome size. Am I doing this right?

ADD REPLYlink written 22 months ago by Tom20

I would say the "eyeball" estimation would be from the first minima to the end of the peak of the last maxima, but I would trust the model of kmergenie more, as it's more roboust than that.

ADD REPLYlink written 22 months ago by apelin20460

Right, you can eyeball the genome size this way. 

Actually, this part of Kmergenie is in the R language, so you could have a look yourself if you're familiar with R. It's in scripts/est-genomic-kmers.r. Line 29 holds the abundance histogram, line 31 holds probability that a kmer is correct at this abundance, and line 38 computes the number of genomic kmers (line numbers might change in future releases).

ADD REPLYlink written 22 months ago by Rayan Chikhi1.2k

Yes, Adrian is I. Thanks for the answer! always nice to learn more.

ADD REPLYlink written 22 months ago by apelin20460
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 508 users visited in the last hour