k-mer distribution to estimate the heterozygosity of my assembly
1
1
Entering edit mode
17 months ago
pablo ▴ 210

Hello,

I had PacBio CCS reads I assembled using hifiasm/v12 . I got an assembly of 1.9Gb whereas I expected a genome of 1.3Gb, which means there is a possible high reads heterozygosity rate (this is a plant genome, which could be possible).

To check that, I used GenomeScope. I got that k-mer distribution :

kmer-distribution-before-purging

We can see two peaks, which corresponds to a diploïd genome/assembly and a heterzygosity rate of 3.13% which is pretty high.

Then, I used purge_dups tool to remove the heterozygous contigs of my assembly. I got a purged assembly of 1.2Gb, what's close to reality. I also checked the k-mer distribution :

out-fn

We can see a very high peak but also a little hump at 35X coverage. Does this hump correspond to the diploid peak, which means purge_dups didn't work well on my assembly? Or is it like an artefact and I really have that high peak, which means my assembly is now hapoïd, purged of the heterozygous sequences?

Best

kmer Assembly pacbio purge_dups • 1.1k views
ADD COMMENT
0
Entering edit mode
12 months ago
kamiljaron ▴ 200

This is actually a really hard question your expected haploid genome size does not correspond well to the genomescope model.

It might be that your genome has recent TE expansions or other stuff that might have caused underestimation of the genome size by genomescope, I have seen this when I tried to estimate the genome size of the marbled crayfish. If this is the case and your haploid genome size is indeed ~1.2G, it seems your pudge dup mostly did work, and the kmers that are there in ~4n peak might be real genomic duplications.

If the real haploid genome size is ~750Mbp, you have about half of gig nts that are probably wrong. However, my best guess is that the genome size estimate will be the problem. Also, for comparing kmer spectra and genome assemblies, I recommend KAT. You can directly visualize kmer spectrum from reads compared to their frequency in the genome assembly.

ADD COMMENT

Login before adding your answer.

Traffic: 848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6