Question

Discrepency in assembly sizes

2

Entering edit mode

5 months ago

hpapoli ▴ 170

Hello,

The haploid genome size estimated from my diploid plant is about 480 Mb. This is consistent with the number from K-mer analysis using GenomeScope2. However, my assembly size, produced from HiFi PacBio reads is about 780 Mb.

My first guess was that my assembly contains uncollapsed haplotypes. I looked at the distribution of coverage of one of my bam files (Illumina reads mapped to the genome assembly) enter image description here . I observe one pick around 1.5 which would be a coverage of about 30X, what I would expect for my sample. If I had uncollapsed haplotypes, I would expect to see a peak around 15X as well but I don't. I wonder where this discrepancy could come from and what other tests I could do to check this?

Just some numbers: The total length of scaffolds with coverage between 10X and 20X is only about 23 Mb. And total length of scaffolds with no reads mapped to them is only about 6 Mb. The genome is repeat-rich, using repeatmodeler and repeat masker, about 50% of the genome was masked as repeats.

Thank you!

Pacbio assembly Kmer • 735 views

ADD COMMENT • link updated 5 months ago by shelkmike ★ 1.8k • written 5 months ago by hpapoli ▴ 170

score 2 · Answer 1 · 2025-05-28

This looks like contamination or symbionts. When I assemble plant genomes, I often see low-coverage contigs from fungi, bacteria, insects, or something else. However, in most cases, the total length of low-coverage contigs is smaller than what you have.
My preferred method to detect and remove sequences of contamination and symbionts currently is aligning all contigs by Megablast to NCBI nt and analyzing top 5 matches, see How can I remove contaminants from an assembled genome?