Dear All,

I am a Biologist trying to understand the statistics of RNA-Seq data.

Given that RNA-Seq follows NB distribution with Biological replicates, as NBD accounts for overdispersion in the data, I am not sure how to ascertain it to my data.

Although I understanstood these distributions through standard books I am unable to comprehend and relate it to RNA-Seq.

Differential expression is my aim.

I have simulated data, to test and understand some open source tools like edgeR, DESeq, Cufflinks etc.

I have real data set too.

I have two conditions with four replicates each.

If I have to know whether my data fits NBD or Poisson distribution, I have to check this across replicates of each gene of each condition??

If the above point is right, how do I do it?

Should I do some goodness of fit test like Chi-sqare test or just the mean variance relationship is enough?

Thanks in advance for your valuable inputs.

Thanks a lot for the reply. It is helpful.

In my post, I also mentioned about Simulations. So, if I have simulated data, what is the way around to check how well it fits a particular distribution (Poisson or NB) ? Is the test of Mean vs Variance or Dispersion enough to be sure that the data fits Poisson distribution or not.

It is important to remember that these distributions describe variance across replicates. I guess what you can do with your simulation is to produce thousands of simulated libraries. Generate them with the same library size so we don't have to normalize. We will treat each simulation as a biological replicate. Then look at the distribution of tag counts for a specific transcript across all your biological replicates. Then see if this distribution fits the NB or poisson better.

For your real dataset, there probably isn't enough replicate libraries for you to fit NB or poisson to.

By the way, I attended a NGS conference last year at University of Nottingham. A group at University of Dundee presented their findings where they performed ~50 biological replicates of yeast(?) to see if current statistical theories hold up. If I remember correctly, they did see that NB fitted the data well. And they also said something like 6 biological replicates was optimal for good DE. And spike-ins also helped a lot for DE.

Do you remember anything in detail surrounding their usage of spike-ins? I'm guessing that they were using them for library-size normalization. The current thinking is generally that spike-ins aren't that useful for most non-single-cell experiments except where there's likely to be gross transcriptional amplification involved. So it'd be interesting if they showed a nice dataset that argued otherwise.

I think they did mention single cell experiments, but unfortunately, I do not remember any details. I guess we'll just have to wait for their publication.