Why Is Overdispersion Worse At Higher Coverage?
2
1
Entering edit mode
12.1 years ago
Stan Letovsky ▴ 140

In this post ZK noted that Poisson is a better approximation at low coverage than at high, something I have also observed. Does anyone have an explanation (or hypothesis) for this? If stochasticity in PCR amplification is a (the?) major source of overdispersion, wouldn't one expect those effects to average out at high coverage?

coverage • 2.2k views
ADD COMMENT
2
Entering edit mode
12.1 years ago

Negative binomial essentially models the case where you merge lots of different poisson distributions, all with slightly different means. So my hypothesis is that it's not stochasticity, but rather subtle biases that affect the likelihood of amplifying or sequencing specific regions.

Say there's a 5% smaller chance of amplifying a given sequence. With coverage of 5x, the counts are going to be very similar to what you'd expect given no bias at all. If you have coverage of 1000x, though, that bias becomes much more apparent.

ADD COMMENT
1
Entering edit mode
12.1 years ago
  • I think that the rate of return on coverage is not linear. My hypothesis is it behaves like a decay function. Perhaps ascertainment bias drives this trend. I could image there are a fraction of sites across the genome that no matter how deep you sequence they won't be covered.

  • Library prep is another potential source for this trend. Some sections of the genome will randomly not be included for sequencing.

  • The crazy: An individual does not contain that sections of the reference genome. The human genome reference sequence is changing and will continue to change. By aligning each human genome to the reference we are assuming that each indvidual contains the same sequence.

  • INDELs are a perfect example. Deletions in relation to the reference will always have 0 coverage. Also Insertions will have coverage where the reference does not.

  • CNVs will affect the balance of coverage and depth when aligned to the reference.

ADD COMMENT
0
Entering edit mode

All, Thanks for the replies. I think we need to distinguish two kinds of overdispersion: that the distribution of coverage across a genome is wider than Poisson, probably due to sequence-content-dependent selection and amplification biases, and that sequences replicates show more-than-Poisson variance in coverage at identical positions. The former sets the mean for the latter, but the latter determines sensitivity for T-like tests of differential coverage in CNV detection, RNASeq differential expression, etc. Overdispersion of the second kind must be a process effect.

ADD REPLY
0
Entering edit mode

"sequencing replicates"

ADD REPLY

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6