So I am trying to estimate the size of a genome, and getting confusing and inconsistent results. So far I have:
- assembly size. All of them tend towards a cumulative size 600-700 Mbp assemblies, depending on the assembler and what sequence data is used, with the higher numbers from assemblies considered better quality. Mapping back reads, DNA, fosmids, etc, show that the assembly is pretty much complete, N50 is good, and there are no obvious problems.
- k-mer estimates on the inbred strain used in the assembly. Using the jellyfish recipe here: https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/, I get an estimate of 1 Gbp
- Edit: k-mer on pooled wild strains. k-mer coverage of 55x, estimated size is slightly above 1 Gbp, consistent with inbred
- mapping estimates. Calculating coverages with samtools depth, integrating (summing coverages times counts) and dividing by modal coverage gives between 800 and 900 Mbp, with the larger libraries tending towards the lower end.
- lab methods (not sure which¹, but not based on sequencing) consistently report a genome size of 1.6 Gbp.
Questions: What am I doing wrong here? Is this kind of discrepancies to be expected (i.e., all these methods are close to worthless)? Are there other methods I can (easily) use to get more estimates?
¹ Edit: I checked. We have used staining densiometry and flow cytometry, both apparently give a size of 1.5-1.6 Gbp, using human and chicken as controls.