Question: Why Does Rna-Seq Read Count Fit Poisson Distribution?
gravatar for sckinta
6.7 years ago by
United States
sckinta580 wrote:

Although R packages like "edgeR" "DEseq" have been frequently used for RNA-seq expression study, I am still confused that why RNA-seq read count fits Poisson Distribution or more dispersed Negative Binomial Distribution.

As far as I know, Poisson distribution is used for the number of events in specified intervals such as time period, distance, area or volume. So I can understand number of mutated sites over a fragment of DNA fit Poisson. But how does count number of reads for a given gene across samples fit this model ?

By the way, I find it hard to fit certain data to an appropriate model. It is too abstract for me. It is particularly true when it comes to the biological data like sequences. Do you have any suggestion to help me overcome “disability of imagination on model" ? Any book or journal to recommend ?

Thank you.

modeling rna-seq • 29k views
ADD COMMENTlink modified 13 months ago by lffu_003250 • written 6.7 years ago by sckinta580
2 this paper has the answer in great detail, can get a bit technical but in my opinion is written very well.

ADD REPLYlink written 6.7 years ago by dfernan690
gravatar for Chris Miller
6.7 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Let's forget the RNAseq complication for a minute and just picture a process whereby you take the genome and choose a location at random to produce a read. This is a Poisson process. If you plot the depth of sequence along this theoretical genome, it will be a poisson distribution. [1]

The same is true for RNAseq, only instead of a genome, you're choosing a read from the transcriptome. This is made a bit more complex by the fact that some transcripts are present at higher abundance than others, but at it's core, it's still a poisson process. In fact, you can think of a negative binomial distribution as a mixture of poisson distributions with different means, which is a reasonable model for sampling from RNA transcripts with different abundances.

[1] Data from real genomes aren't exactly poisson, due to biases related to the composition of the genome, chemistry of sequencing, assembly errors in the reference genome, and inability to map to repetitive regions.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Chris Miller21k

That is really helpful. Thank you, Chris :P

ADD REPLYlink written 6.7 years ago by sckinta580

Awesome answer. I hope you are a teacher or mentor to someone out there :)

ADD REPLYlink written 5.4 years ago by gaelgarcia180
gravatar for Michele Busby
6.7 years ago by
Michele Busby2.1k
United States
Michele Busby2.1k wrote:

This is a good question and one you really need to understand to interpret your data.

The way I think about it is this:

If you are sampling something by a count, say the selection of blue marbles out of a bag of different colored marbles, the count for any given color of marble will fluctuate around the true value.

For example, if the true proportion of marbles that are blue in the bag is 4% and you select 100 marbles sometimes you will get 4, sometimes you will get 5 or 3. If you do that an infinite number of times and the bag is big enough the distribution you will get is a Poisson distribution. (This is the Poisson approximation to the binomial).

We use a Poisson distribution to describe RNA Seq data because we are selecting species of RNA out of a pool of RNA molecules. So instead of a blue and red marbles you have SUMO2 and BRCA1 RNA.

A poisson distribution is a one parameter distribution. The mean of the distribution is always equal to the variance. It can only describe the counting noise and nothing else.

However, counting noise won't be the only source of variance in your data. You will also have biological noise and technical artifact. Some genes may fluctuate more or less because of their inherent nature (e.g. heat shock genes have high variance in biological replicates because they bounce around if someone puts the heat up). If you measure the expression of these genes (and really all genes') in multiple replicates the variance of those measurements will be higher than the mean. So you don't want to use a Poisson distribution to describe that.

So what to do?

A negative binomial distribution can be thought of in lots of different ways, but one thing that happens is that if you sample from a gamma distribution using counting you get a negative binomial distribution. A gamma distribution kind of in the middle of a lognormal and a normal distribution.

So using a negative binomial distribution assumes that the noise from biological and technical variance is roughly described by a gamma distribution but then also accounts for the sampling noise.

I wrote this thing here that explains variance a little more and how to think about it. Maybe you'll find it useful.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Michele Busby2.1k

I think the marble metaphor should follow the hypergeometric distribution rather than Poisson distribution, isn't it? Maybe that is the reason you explained as " This is the Poisson approximation to the binomial". Chris above gave a more acceptable example "you take the genome and choose a location at random to produce a read", that meets the Poisson definition "number of events in specified intervals such as time period, distance, area or volume".

The variance analysis in the link is very helpful. Thank you very much. I should keep in mind that "Poisson variance is the inherent uncertainty that is present in any measurement made where something is sampled and counted".

One more thing to ask, you mentioned "sample from a gamma distribution using counting", what follows a gamma distribution ? I thought in NB distribution, the mean/variance of Poisson model follows the gamma distribution. Here you said "biological and technical variance is roughly described by a gamma distribution ". There are intersection among Poission Variance, biological variance and technical variance ??

ADD REPLYlink written 6.7 years ago by sckinta580

The sampling from the marble bag is hypogeometric if you don’t have replacement but is approximately binomial if the number of blue marbles is small relative to the size of the bag. The counts approximate a Poisson distribution but the Poisson approximation to the binomial (see Mathematical Statisicts by Miller and Miller, p 186). So I guess that should say Poisson approximation to the binomial approximation to the hypogeometric. For RNA Seq the number of reads you sequence relative to the number of RNA molecules in the sample is usually miniscule and the approximations are pretty robust when you are getting into sample sizes of millions. You can try it by throwing random numbers in R or Matlab and seeing how much you have to sample before they break. Empirically, the Poisson model fits very well with RNA Seq data and in very good complex replicates on the same sequencing run the variance you see in scatter plots almost follows the expectation of a Poisson sampling (i.e. there isn’t much extra technical variance and the main problem is sampling. These assumptions, however, do break down wildly in over-sequenced libraries because the PCR amplification does strange things and then the Poisson model becomes a pretty crappy model. So you have to look at your data. Plotting the Poisson on top of the technical replicates is the easiest way to see if something is amiss. I was over generalizing when I said counts ALWAYS follow Poisson by they usually follow Poisson so if you are building a model that is usually a good place to start. As for the gamma, I think what they are modeling there is the expression of a single gene in biological replicates. This distribution (the true expression on a continuous scale of measurement) is traditionally considered to be lognormal (from microarray and qPCR data) though this is difficult to confirm empirically. Theoretically, a mixture of a lognormal and a Poisson is a Poisson lognormal, but using that require integration which makes things run slow without really improving performance over approximating the lognormal with the gamma and using the negative binomial distribution.
Anyway, when I was working on this stuff I just assumed a normal distribution and in simulated data it worked pretty much as well as the negative binomial. In RNA Seq there are some biological constraints where you don’t really get genes expressed with really long tails. There is a wide dynamic range over all the genes, but most genes live in a pretty small portion of that dynamic range. Non-parametric statistics are the most theoretically correct way to go anyway.

ADD REPLYlink written 6.7 years ago by Michele Busby2.1k
gravatar for Ann
6.0 years ago by
Concord NC USA
Ann2.3k wrote:

Hi all,

This may help:

Videos by Rafael Irizarry from his EdX class on data analysis for genomics:


The first one explains why RNA-seq is like Poisson, using an example from playing the lottery. The second follows up on that idea.



ADD COMMENTlink written 6.0 years ago by Ann2.3k

These videos really helped. Thanks Ann! 

ADD REPLYlink written 5.8 years ago by jollier.liu0

Rafa Izarry <3

ADD REPLYlink written 5.4 years ago by gaelgarcia180

Which edx course is that?

ADD REPLYlink written 4.9 years ago by SmallChess510
gravatar for Jeremy Leipzig
6.7 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

I think revisiting Mark Robinson's answer from What Makes One Probability Distribution Better For Rna-Seq Than Another? is fitting here (no pun intended). He implies that negative binominal fits better because variability between biological replicates is more than Poisson can account for.

ADD COMMENTlink written 6.7 years ago by Jeremy Leipzig19k

+1 for negative "binominal" :

ADD REPLYlink written 6.7 years ago by Zev.Kronenberg11k

Yes, I read that post before I posted this question. I could understand why NB model fits better than Poisson. I am just confused by models---How to choose a model to fit our data. I think it is more general than that post.

ADD REPLYlink written 6.7 years ago by sckinta580

It's a dark art.

ADD REPLYlink written 6.7 years ago by Michele Busby2.1k
gravatar for lffu_0032
13 months ago by
lffu_003250 wrote:

if you think a read as a point, then in theory, for a region L, the probability for a read fall into region L is: L/G, here G is the genome size. if the depth is evenly distributed, which says that the the genomes of a batch of cells were randomly broken and the sequencing process was uniform, then the expected reads num fall in this region L is: L/G * N, N is the total reads num. based on this assumption, then you can model the distribution of Read count, RC, as a Possion probability distribution. hope this helps.

ADD COMMENTlink written 13 months ago by lffu_003250
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1571 users visited in the last hour