K-mer spectra of genomic reads are usually modeled as a poisson mixture process where each of the components represents the variety of ploidies of a genome. So a k-mer spectra that looks like the following can be interpreted as showing a poisson component representing heterozygous regions (1N) and a second poisson component representing homozygous regions.
I've been looking at the "read coverage" spectra of assembled contigs recently. This is done by:
- aligning my reads (treated everything as single end) to the assembled contigs (assembled with SGA, not a de bruijn based assembler) with bwa mem
- Read coverage of each contig was determined by sum length of reads aligning / length of contig
- Only contigs larger than 500 bp was used for fitting mixture distributions. But all contigs were used in mapping.
I noticed that a mixture poisson doesn't fit all that well to the coverage data. The distribution of the components have a sloping shoulder to the left of the peaks.
There are obvious big differences between k-mer and read-coverage spectras. I am thinking the reason why poisson doesn't fit well to this is because assembled contigs have potential overlapping ends that were not assembled together due to ambiguity in the OLC graph. These overlapping ends result in multi-mapping reads getting discarded leading to lower read coverages.
Has anyone encountered this before and what other reasons do you think can be causing this? Is this a technical issue in the way I mapped/assembled the reads? Or the way I calculated coverage?