I had PacBio CCS reads I assembled using hifiasm/v12 . I got an assembly of 1.9Gb whereas I expected a genome of 1.3Gb, which means there is a possible high reads heterozygosity rate (this is a plant genome, which could be possible).

To check that, I used GenomeScope. I got that k-mer distribution :


We can see two peaks, which corresponds to a diploïd genome/assembly and a heterzygosity rate of 3.13% which is pretty high.

Then, I used purge_dups tool to remove the heterozygous contigs of my assembly. I got a purged assembly of 1.2Gb, what's close to reality. I also checked the k-mer distribution :


We can see a very high peak but also a little hump at 35X coverage. Does this hump correspond to the diploid peak, which means purge_dups didn't work well on my assembly? Or is it like an artefact and I really have that high peak, which means my assembly is now hapoïd, purged of the heterozygous sequences?


