Question

Why Is Overdispersion Worse At Higher Coverage?

1

Entering edit mode

12.1 years ago

Stan Letovsky ▴ 140

In this post ZK noted that Poisson is a better approximation at low coverage than at high, something I have also observed. Does anyone have an explanation (or hypothesis) for this? If stochasticity in PCR amplification is a (the?) major source of overdispersion, wouldn't one expect those effects to average out at high coverage?

coverage • 2.2k views

ADD COMMENT • link updated 12.1 years ago by Chris Miller 22k • written 12.1 years ago by Stan Letovsky ▴ 140

score 2 · Answer 1 · 2012-02-29

Negative binomial essentially models the case where you merge lots of different poisson distributions, all with slightly different means. So my hypothesis is that it's not stochasticity, but rather subtle biases that affect the likelihood of amplifying or sequencing specific regions.

Say there's a 5% smaller chance of amplifying a given sequence. With coverage of 5x, the counts are going to be very similar to what you'd expect given no bias at all. If you have coverage of 1000x, though, that bias becomes much more apparent.

score 1 · Answer 2 · 2012-02-29

1

Entering edit mode

12.1 years ago

Zev.Kronenberg 12k

I think that the rate of return on coverage is not linear. My hypothesis is it behaves like a decay function. Perhaps ascertainment bias drives this trend. I could image there are a fraction of sites across the genome that no matter how deep you sequence they won't be covered.
Library prep is another potential source for this trend. Some sections of the genome will randomly not be included for sequencing.
The crazy: An individual does not contain that sections of the reference genome. The human genome reference sequence is changing and will continue to change. By aligning each human genome to the reference we are assuming that each indvidual contains the same sequence.
INDELs are a perfect example. Deletions in relation to the reference will always have 0 coverage. Also Insertions will have coverage where the reference does not.
CNVs will affect the balance of coverage and depth when aligned to the reference.

ADD COMMENT • link 12.1 years ago by Zev.Kronenberg 12k

0

Entering edit mode

All, Thanks for the replies. I think we need to distinguish two kinds of overdispersion: that the distribution of coverage across a genome is wider than Poisson, probably due to sequence-content-dependent selection and amplification biases, and that sequences replicates show more-than-Poisson variance in coverage at identical positions. The former sets the mean for the latter, but the latter determines sensitivity for T-like tests of differential coverage in CNV detection, RNASeq differential expression, etc. Overdispersion of the second kind must be a process effect.