Question: Peak Finding Of Chip-Seq (Or Clip-Seq) And Poisson Distribution, Am I Right?
6
Puriney330 wrote:

Hi, Check my understanding towards how Poisson distribution is employed when finding peaks of CHIP-seq and CLIP-seq. It is well known that the number of times for a base is sequenced follows a Poisson distribution. Just like people going to supermarket using a particular entrance in a given period of time. Poisson distribution can plotted as following: Here the average is 7 (lambda=7) plotted in red line. And green line denotes the edge of probabiliy is 0.975.

From the genome wide scale, coverage = Read Length (nt) * Total Reads Number * / *Genome Length (nt). Indicating the average number of reads that hit a base. So assign the coverage as the lambda (or mean) of Poisson distribution, let's say also 7 here. x-axis means the reads number for a base, and y-axis means the probability of a given reads number.

Thus, we can know when the probability is 0.975, the reads number is <= 13. If the reads number detected in real CHIP-seq/CLIP-seq is larger than 14, we will know it is almost impossible, as long as the reads follow poisson distribution. However, in the real experiment, we detect for a position, the reads number is, let's say 20. Thus, how to explain this result? It is because here is the enrichment induced by chromatin binding protein (in CHIP-seq). Those reads are not random distributed. Thus, it can get a p-value for reads number = 20 by calling ppois(20-1,lambda=7,low.tail=FALSE) in R.

Am I right? 'Cause I don't want to get wrong understanding.

chip-seq • 7.0k views
modified 7.1 years ago by KCC4.0k • written 7.1 years ago by Puriney330
5
KCC4.0k wrote:

I am somewhat new to this field, but I have been reading about the topic of peak calling for a few months now and I think I might have some perspective to add here.

Your analysis sounds like a good description of the most basic theoretical approach to a problem like this. If I was teaching a class on ChIP-seq, your description would be a good first description of a simple peak caller. If you actually tried this approach with real data however, you would immediately see some problems. Let me try to describe a few.

It is well known that the number of times for a base is sequenced follows a Poisson distribution.

People will often use Poisson distributions in sequencing data. One of the reasons for this is there is only one parameter to fit, which makes it easier. However, from what I understand, you will frequently observe more variance in the actual data than is assumed in the Poisson distribution. This is called overdispersion. Therefore, it's unlikely that the Poisson distribution is an accurate description of most real data. The negative binomial is often a more accurate model. However, you need more data to fit a negative binomial. (There is an interesting discussion on the tension between using the Poisson and the negative binomial in the paper introducing the Deseq tool.)

In truth, using one distribution for the whole genome is unlikely to give a satisfying level of accuracy for peak calling. Often, you will need to fit a distribution to smaller regions on the genome using either a sliding window or discrete windows.

If you look at the paper on MACS, you will get a really nice use of multiple Poisson distributions.

If the reads number detected in real CHIP-seq/CLIP-seq is larger than 14, we will know it is almost impossible, as long as the reads follow poisson distribution.

Typically in ChIP-seq you are looking at the whole genome, so if something has a 2.5% probability of occuring, that's still millions of sites (depending on the size of the genome). More than likely for real data, your cutoff will be much, much lower than 2.5%

1
Istvan Albert ♦♦ 83k wrote:

Make sure not to overemphasize the role of statistics when it comes to interpretation.

That data that you see is almost always greatly burdened by substantial biases arising from each step of sample extraction, library preparation, sequencing and alignment. Your reads are never randomly distributed the only question is this: Do the biases in my data completely invalidate the statistical method that I am trying to apply or only just partially and some can be salvaged.

An other way to say this is that the statistical methods as the ones above that you cite above are at best a very rough and very idealistic approximation. We use them in lieu of anything better and not because they are a ideal way to model the data.

1

To add to what Istvan says, the distribution of reads in a sequencing experiment is usually NOT Poisson, because of any number of biases (GC-content, mapability, others we don't fully understand). It's usually better approximated by a negative binomial distribution, which is essentially an overdispersed Poisson.