Question: Peak Finding Of Chip-Seq (Or Clip-Seq) And Poisson Distribution, Am I Right?
6
gravatar for Puriney
6.2 years ago by
Puriney330
New York City
Puriney330 wrote:

Hi, Check my understanding towards how Poisson distribution is employed when finding peaks of CHIP-seq and CLIP-seq. It is well known that the number of times for a base is sequenced follows a Poisson distribution. Just like people going to supermarket using a particular entrance in a given period of time. Poisson distribution can plotted as following:

enter image description here

Here the average is 7 (lambda=7) plotted in red line. And green line denotes the edge of probabiliy is 0.975.

From the genome wide scale, coverage = Read Length (nt) * Total Reads Number * / *Genome Length (nt). Indicating the average number of reads that hit a base. So assign the coverage as the lambda (or mean) of Poisson distribution, let's say also 7 here. x-axis means the reads number for a base, and y-axis means the probability of a given reads number.

Thus, we can know when the probability is 0.975, the reads number is <= 13. If the reads number detected in real CHIP-seq/CLIP-seq is larger than 14, we will know it is almost impossible, as long as the reads follow poisson distribution. However, in the real experiment, we detect for a position, the reads number is, let's say 20. Thus, how to explain this result? It is because here is the enrichment induced by chromatin binding protein (in CHIP-seq). Those reads are not random distributed. Thus, it can get a p-value for reads number = 20 by calling ppois(20-1,lambda=7,low.tail=FALSE) in R.

Am I right? 'Cause I don't want to get wrong understanding.

chip-seq • 6.4k views
ADD COMMENTlink modified 6.2 years ago by KCC3.9k • written 6.2 years ago by Puriney330
5
gravatar for KCC
6.2 years ago by
KCC3.9k
Cambridge, MA
KCC3.9k wrote:

I am somewhat new to this field, but I have been reading about the topic of peak calling for a few months now and I think I might have some perspective to add here.

Your analysis sounds like a good description of the most basic theoretical approach to a problem like this. If I was teaching a class on ChIP-seq, your description would be a good first description of a simple peak caller. If you actually tried this approach with real data however, you would immediately see some problems. Let me try to describe a few.

It is well known that the number of times for a base is sequenced follows a Poisson distribution.

People will often use Poisson distributions in sequencing data. One of the reasons for this is there is only one parameter to fit, which makes it easier. However, from what I understand, you will frequently observe more variance in the actual data than is assumed in the Poisson distribution. This is called overdispersion. Therefore, it's unlikely that the Poisson distribution is an accurate description of most real data. The negative binomial is often a more accurate model. However, you need more data to fit a negative binomial. (There is an interesting discussion on the tension between using the Poisson and the negative binomial in the paper introducing the Deseq tool.)

In truth, using one distribution for the whole genome is unlikely to give a satisfying level of accuracy for peak calling. Often, you will need to fit a distribution to smaller regions on the genome using either a sliding window or discrete windows.

If you look at the paper on MACS, you will get a really nice use of multiple Poisson distributions.

If the reads number detected in real CHIP-seq/CLIP-seq is larger than 14, we will know it is almost impossible, as long as the reads follow poisson distribution.

Typically in ChIP-seq you are looking at the whole genome, so if something has a 2.5% probability of occuring, that's still millions of sites (depending on the size of the genome). More than likely for real data, your cutoff will be much, much lower than 2.5%

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by KCC3.9k
1
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Make sure not to overemphasize the role of statistics when it comes to interpretation.

That data that you see is almost always greatly burdened by substantial biases arising from each step of sample extraction, library preparation, sequencing and alignment. Your reads are never randomly distributed the only question is this: Do the biases in my data completely invalidate the statistical method that I am trying to apply or only just partially and some can be salvaged.

An other way to say this is that the statistical methods as the ones above that you cite above are at best a very rough and very idealistic approximation. We use them in lieu of anything better and not because they are a ideal way to model the data.

ADD COMMENTlink written 6.2 years ago by Istvan Albert ♦♦ 80k
1

To add to what Istvan says, the distribution of reads in a sequencing experiment is usually NOT Poisson, because of any number of biases (GC-content, mapability, others we don't fully understand). It's usually better approximated by a negative binomial distribution, which is essentially an overdispersed Poisson.

ADD REPLYlink written 6.2 years ago by Chris Miller20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour