We've just completed the de novo assembly of an insect species. We utilized PacBio HiFi reads with approximately 50X coverage, estimated based on the genome size of a closely related species. The k-mer genome size estimation using Illumina reads indicated about 250 Mb.
The sample was pooled from the population of about 50 individuals, so it's highly heterozygous.
1) First We assembled with Hifiasm with the following 2 rounds of purging with purge_dups. The main genome assembly statistics are: a) contigs: 362 b) Total Length: 357 Mb c) N50: 8.3 Mb d) L50: 12. e) N90: 548k f) BUSCO metazoa: C:95.6%[S:94.8%,D:0.8%],F:0.6%,M:3.8%.
2) Then we used NextDenovo without purging. The statistics we got: a) contigs: 426 b) Total Length: 224 Mb c) N50: 19.1 Mb d) L50: 5. e) N90: 125k f) BUSCO metazoa: C:97.5%[S:96.8%,D:0.7%],F:0.7%,M:1.8%.
So, excluding the numerous small contigs the NextDenovo assembly is better. But nonetheless the metrics of Hifiasm assembly not too bad, and I'm confused by the 130Mb difference in total genome size.
While I lean towards choosing the NextDenovo assembly for subsequent analyses such as annotation and FISH mapping, I can't dismiss the significant difference in size between two relatively good assemblies
Are there any short read datasets available in SRA? Can you try and see how many left over reads remain after you align to both of your assemblies? What kind of coverage do you get.