I'm working in a non-model organism for which we do have a reference genome. I was recently learning about the k-mer sprectrum and learned that it could be used to estimate the length of a genome following a pretty straightforward tutorial.
This method however requires to set a length k for the k-mers, and according to some authors, a proper measure should itself depend on the genome length (Chor et la 2009), which is reasonable.
The conflict comes with the different outputs. With k=10 the spectrum looks great, as a typical unimodal distribution. The peak of the distribution is also close to the real coverage, so again this is great news. But the genome length estimate is about 3-4Mb!! if I use k=14 the opposite happens: the distribution is all skewed to the left, but the estimate is roughly 128Mb, which is not weird given the organism.
How would you double check which of these outcomes is correct? Another question is, if I wouldn't have this previous idea on the genome length (~120Mb) how would one deal with this problem?
It is not simpler to get a VCF file with base pair resolution and count the number of total positions? isn't this adequate? how do you measure the genome length in practice?