Entering edit mode

8.5 years ago

Tom
▴
20

This may sound rudimentary, but where is KmerGenie getting it's y-axis (number of genomic k-mers) numbers from when it goes to plot the final plot in the HTML output?

I was under the assumption that it was taking the local maximum of each k histogram's curve, and taking that number and putting it into the Kmerplot. However, my numbers aren't lining up with the histogram's numbers; I have no idea what's going on. What happens when my where should I be looking to calculate the y-value for each kmer in the final graph?

If you mean the first graph, then no. It's not a local maxima. I believe it's the area under the graph, but there my be a cutoff, like x > 1.

Can you make an educated guess as to where its integrating from and to?

Send an email to the author, he usually replies pretty quickly.

For some reason I don't get biostars email notifications anymore.

Adrian (-- I'm assuming it's him) is right, it's indeed an area under the histogram curve. It is weighted by the probability that, for a given abundance, a kmer is erroneous or not.

E.g. if you have 10 kmers of abundance 1, but the model thinks that the density of erroneous kmers at abudance 1 is 0.7, kmergenie will predict (1-0.7)*10 = 3 genomic kmers for this abundance. Then kmergenie sums the predicted number of genomic kmers over all abundances.

This is for the haploid model; for the diploid model, it's slightly more advanced, to take into account heterozygous kmers (and divide their contribution by two).

Thank you for your input.

So considering that camel hump graph that's generated for say, k = 35. All data after the first "dip" should be ignored. Then, I should take the area under the curve from that first local minimum, all the way to the right to the tail end of the graph. That should in theory yield the number of bases equal to my genome size. Am I doing this right?

I would say the "eyeball" estimation would be from the first minima to the end of the peak of the last maxima, but I would trust the model of kmergenie more, as it's more roboust than that.

Right, you can eyeball the genome size this way.

Actually, this part of Kmergenie is in the R language, so you could have a look yourself if you're familiar with R. It's in scripts/est-genomic-kmers.r. Line 29 holds the abundance histogram, line 31 holds probability that a kmer is correct at this abundance, and line 38 computes the number of genomic kmers (line numbers might change in future releases).

Yes, Adrian is I. Thanks for the answer! always nice to learn more.