Question

Why number of reads in a sample that are assigned to gene i can be modeled by a negative bionomial distribution ?

2

Entering edit mode

9.8 years ago

jack ▴ 960

Hi,

I'm reading the paper called: Differential expression analysis for sequence count data. in the part of model description they have mentioned that: "We assume that the number of reads in sample j that are assigned to gene i can be modeled by a negative binomial (NB) distribution,".

I don't understand why it can be modeled by a negative binomial?

Intuitively, negative binomial distribution is the probability distribution of independent trails for k successes.

Would someone elaborate more that, why their assumption are make sense?

next-gen RNA-Seq Assembly • 2.6k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by jack ▴ 960

Ram · Answer 1 · 2014-07-06

2

Entering edit mode

9.8 years ago

Devon Ryan 104k

Given that the expected expression of a gene i in a treatment group is n (given some sequencing depth), the count observed in sample j should be drawn from a distribution centered around n. The question, then, is how one should describe the variance of that distribution. Earlier methods assumed Poisson variance, which nicely model technical variance. There's also, however, biological variance, since samples are never identical. Thus, an over-dispersed Poisson distribution (i.e., a negative binomial distribution) outlines things nicely.

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

But still, negative binomial has its own definition and its application scenarios. Why RNA-seq fits into the definition and the application scenarios?

ADD REPLY • link 7.5 years ago by moxu ▴ 510

0

Entering edit mode

Yes, he did not answer your question. :)

ADD REPLY • link 3.2 years ago by iannuzzir91 • 0

Ram · Answer 2 · 2014-07-06

You're not alone in finding this confusing. See also this question and replies:

Why Does Rna-Seq Read Count Fit Poisson Distribution?

I'm not sure if this is right, but it seems to me that knowing why the distribution fits the data is not strictly necessary - the key is that the distribution fits and this means you can use it in statistical testing. In other words, the negative binomial distribution is useful for modeling RNA-Seq gene expression data, and since the model is pretty good, you can use it to find out if a treatment or condition has changed a gene's expression. However, the assumption that the nb is good fit might not hold if the assay changes a lot. And in the case of sequencing based assays like RNA-Seq, the assays are constantly changing. Which means you should try to test whether the nb still holds when working with an entirely new data set.

Good luck and if you find any other good links please post them.