Question: What Makes One Probability Distribution Better For Rna-Seq Than Another?
gravatar for Jeremy Leipzig
9.0 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

Many experts claim that the negative binomial distribution is better than the Poisson distribution for modeling discrete RNA-Seq data.

Is this because

  • The expression profiles of organisms resemble the negative binomial distribution - i.e.if I binned real gene transcription counts from an RNA-Seq experiment it would look like the plot below.


  • The sampling error of gene expression is such that the true population mean of a gene looks like the negative binomial - i.e the true mean expression level of that gene is probably more (because of the skew) than the mean expression level of a sample of reads of that gene drawn from replicates.

or are these two concepts the same thing?

alt text

rna • 22k views
ADD COMMENTlink modified 9.0 years ago by Stan Letovsky140 • written 9.0 years ago by Jeremy Leipzig19k

Moved to separate question per suggestion

ADD REPLYlink modified 5 months ago by RamRS25k • written 8.0 years ago by Stan Letovsky140

Please don't append questions to other questions, I think your questions would stand quite well on their own as a separate topic. Please ask a separate question.

ADD REPLYlink written 8.0 years ago by Daniel Swan13k
gravatar for Mark Robinson
9.0 years ago by
Mark Robinson500
Mark Robinson500 wrote:

Hi Jeremy.

It depends what you mean by "modeling discrete RNA-Seq data". If differential expression is your game, you need to keep in mind that we/others model the distribution of counts for a given gene across replicates; we don't model the distribution of an individual sample, as you suggest in your first point above. So for DE, we are not (explicitly) concerned with shape of the distribution of counts for the whole organism at all.

Assuming DE is your interest, the reason we choose NB over Poisson is that, in real biological applications, there is simply more variability than Poisson can explain. Poisson is a single parameter dist'n, with mean=variance. That assumption, which is really an approximation to the binomial, is suitable only for the variability associated with sampling the same DNA population (e.g. if you sequence multiple lanes of the same DNA, and assume no lane-specific effects, etc.). But if there is variation between your replicates (e.g. lab mice, people, etc.), the Poisson assumption will tend to underestimate the variance and any differences you observe (e.g. testing the null hypothesis that two groups have the same mean expression) will be overstated.

If DE is not your interest, then you'll have to explain what your end goals are (at least to me).

HTH, Mark (co-author of edgeR)

ADD COMMENTlink written 9.0 years ago by Mark Robinson500
gravatar for Chris Miller
9.0 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

It's not about the skew of the distribution, it's about the variance. Take a look at the top two panels of this image (self link). The big difference is that the distribution based on a NB distribution can be shorter and fatter, or overdispersed.

The key to understanding the negative binomial distribution is that it's the same as taking lots of poisson distributions with slightly different means and adding them together. Essentially, this is saying that there's some sort of bias in the data causing this variability, but we aren't sure exactly what's causing it. The obvious candidates (GC content, mapability, sequence composition, etc) don't explain it all. Since it's difficult to correct for something when we don't know the source, we can do the next best thing, which is to model the distribution we see accurately. That's where NB distributions can overcome some of the limitations of poisson (namely that the variance/mean ratio is always equal to one).

ADD COMMENTlink modified 9.0 years ago • written 9.0 years ago by Chris Miller21k

note: the figure in your paper deals with CNV's, not expression. That doesn't mitigate your point about modeling variability, just making sure people don't think that is an expression histogram.

ADD REPLYlink written 9.0 years ago by Jeremy Leipzig19k

Right. Sorry if that was confusing. The principle is the same, but the data is different.

ADD REPLYlink written 9.0 years ago by Chris Miller21k

the link was broken. Here is the link: Look for figure 1.

ADD REPLYlink written 21 months ago by Dataman310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 679 users visited in the last hour