Why number of reads in a sample that are assigned to gene i can be modeled by a negative bionomial distribution ?
2
2
Entering edit mode
10.1 years ago
jack ▴ 980

Hi,

I'm reading the paper called: Differential expression analysis for sequence count data. in the part of model description they have mentioned that: "We assume that the number of reads in sample j that are assigned to gene i can be modeled by a negative binomial (NB) distribution,".

I don't understand why it can be modeled by a negative binomial?

Intuitively, negative binomial distribution is the probability distribution of independent trails for k successes.

Would someone elaborate more that, why their assumption are make sense?

next-gen RNA-Seq Assembly • 2.7k views
2
Entering edit mode
10.1 years ago

Given that the expected expression of a gene i in a treatment group is n (given some sequencing depth), the count observed in sample j should be drawn from a distribution centered around n. The question, then, is how one should describe the variance of that distribution. Earlier methods assumed Poisson variance, which nicely model technical variance. There's also, however, biological variance, since samples are never identical. Thus, an over-dispersed Poisson distribution (i.e., a negative binomial distribution) outlines things nicely.

0
Entering edit mode

But still, negative binomial has its own definition and its application scenarios. Why RNA-seq fits into the definition and the application scenarios?

0
Entering edit mode

1
Entering edit mode
10.1 years ago
Ann ★ 2.4k

You're not alone in finding this confusing. See also this question and replies:

Why Does Rna-Seq Read Count Fit Poisson Distribution?

I'm not sure if this is right, but it seems to me that knowing why the distribution fits the data is not strictly necessary - the key is that the distribution fits and this means you can use it in statistical testing. In other words, the negative binomial distribution is useful for modeling RNA-Seq gene expression data, and since the model is pretty good, you can use it to find out if a treatment or condition has changed a gene's expression. However, the assumption that the nb is good fit might not hold if the assay changes a lot. And in the case of sequencing based assays like RNA-Seq, the assays are constantly changing. Which means you should try to test whether the nb still holds when working with an entirely new data set.

Good luck and if you find any other good links please post them.