In this post ZK noted that Poisson is a better approximation at low coverage than at high, something I have also observed. Does anyone have an explanation (or hypothesis) for this? If stochasticity in PCR amplification is a (the?) major source of overdispersion, wouldn't one expect those effects to average out at high coverage?
Negative binomial essentially models the case where you merge lots of different poisson distributions, all with slightly different means. So my hypothesis is that it's not stochasticity, but rather subtle biases that affect the likelihood of amplifying or sequencing specific regions.
Say there's a 5% smaller chance of amplifying a given sequence. With coverage of 5x, the counts are going to be very similar to what you'd expect given no bias at all. If you have coverage of 1000x, though, that bias becomes much more apparent.
I think that the rate of return on coverage is not linear. My hypothesis is it behaves like a decay function. Perhaps ascertainment bias drives this trend. I could image there are a fraction of sites across the genome that no matter how deep you sequence they won't be covered.
Library prep is another potential source for this trend. Some sections of the genome will randomly not be included for sequencing.
The crazy: An individual does not contain that sections of the reference genome. The human genome reference sequence is changing and will continue to change. By aligning each human genome to the reference we are assuming that each indvidual contains the same sequence.
INDELs are a perfect example. Deletions in relation to the reference will always have 0 coverage. Also Insertions will have coverage where the reference does not.
CNVs will affect the balance of coverage and depth when aligned to the reference.