Negative Binomial and Poisson distribution of RNA-Seq
2
6
Entering edit mode
9.7 years ago
ashwini ▴ 100

Dear All,

I am a Biologist trying to understand the statistics of RNA-Seq data.

Given that RNA-Seq follows NB distribution with Biological replicates, as NBD accounts for overdispersion in the data, I am not sure how to ascertain it to my data.

Although I understanstood these distributions through standard books I am unable to comprehend and relate it to RNA-Seq.

Differential expression is my aim.

I have simulated data, to test and understand some open source tools like edgeR, DESeq, Cufflinks etc.

I have real data set too.

I have two conditions with four replicates each.

If I have to know whether my data fits NBD or Poisson distribution, I have to check this across replicates of each gene of each condition??

If the above point is right, how do I do it?

Should I do some goodness of fit test like Chi-sqare test or just the mean variance relationship is enough?

Thanks in advance for your valuable inputs.

rna-seq Negative-binomial-distribution • 9.9k views
14
Entering edit mode
9.7 years ago

Poisson distribution accounts for technical variance. NB distribution accounts for both technical and biological variance.

NB distribution is also a Poisson-gamma mixture distribution. Imagine you have a single biological sample (RNA extract) that you take aliquots out of to make technical replicates. These technical replicates will be Poisson distributed.

Now imagine you have multiple biological samples. You take multiple technical replicates out each biological replicate. You essentially now have multiple Poisson distributions for each biological replicate. The multiple Poisson distributions for each biological replicate can be described by a gamma distribution. Thus NB distribution (Poisson-gamma mixture) is used for RNA-seq.

You can also think of it as the lambda variable of the Poisson distribution is gamma distributed.

0
Entering edit mode

Thanks a lot for the reply. It is helpful.

In my post, I also mentioned about Simulations. So, if I have simulated data, what is the way around to check how well it fits a particular distribution (Poisson or NB) ? Is the test of Mean vs Variance or Dispersion enough to be sure that the data fits Poisson distribution or not.

4
Entering edit mode

It is important to remember that these distributions describe variance across replicates. I guess what you can do with your simulation is to produce thousands of simulated libraries. Generate them with the same library size so we don't have to normalize. We will treat each simulation as a biological replicate. Then look at the distribution of tag counts for a specific transcript across all your biological replicates. Then see if this distribution fits the NB or poisson better.

For your real dataset, there probably isn't enough replicate libraries for you to fit NB or poisson to.

By the way, I attended a NGS conference last year at University of Nottingham. A group at University of Dundee presented their findings where they performed ~50 biological replicates of yeast(?) to see if current statistical theories hold up. If I remember correctly, they did see that NB fitted the data well. And they also said something like 6 biological replicates was optimal for good DE. And spike-ins also helped a lot for DE.

1
Entering edit mode

Do you remember anything in detail surrounding their usage of spike-ins? I'm guessing that they were using them for library-size normalization. The current thinking is generally that spike-ins aren't that useful for most non-single-cell experiments except where there's likely to be gross transcriptional amplification involved. So it'd be interesting if they showed a nice dataset that argued otherwise.

0
Entering edit mode

I think they did mention single cell experiments, but unfortunately, I do not remember any details. I guess we'll just have to wait for their publication.

4
Entering edit mode
9.7 years ago

If you have biological replicates, then they're pretty much guaranteed to fit a negative-binomial distribution better than a Poisson distribution (otherwise, there's no biological variance). If you wanted to check, graph variance vs mean. If the values don't cluster on the dispersion==mean line, then it's not Poisson.