Question: Why NB distribution is preferred in RNA-Seq while normal distribution is preferred in microarray?
1
CY550 wrote:

Most of DE tools (such as DE-Seq) applied in RNA-Seq assume that gene expression follows Negative Binomial distribution (because of both technical and biological variation). While DE tools originated from Microarray (such as limma) assumes normal distribution? Is this difference due to some technical difference between RNA-Seq and Microarray?

rna-seq microarray • 377 views
written 7 months ago by CY550
1

RNA-seq is count data (discrete), microarray is measured data (continuous). This is a pretty big difference to start with.

From what I understanding, counting reads from RNA-Seq is like sampling reads aligned on specific gene from reads pool. It represents Poisson process where we have small p (probability) and large n (total reads). Plus we have biological variation between samples. Therefore, we got Poisson with larger variance ~ Negative Binomial distribution. For Microarray data, I imagine we intuitively have the same technical variation (Poisson) and biological variation. Would not this form NB distribution as well instead of normal distribution?

5
dsull1.7k wrote:

Microarrays aren't poisson processes. You aren't modeling discrete events. We can't think of it like "what is the probability of having k number of reads for a given gene". That's because microarrays are based on continuous signal intensities.

You can think of sequencing reads as success/fail trials (bernoulli -> binomial -> poisson), you can't think of continuous signal intensities that way.

Just because something has variation (actually, all real data has variation) doesn't mean it's Poisson or shot noise. Just look at any graph of the poisson distribution: the random variable is discrete numbers. Can you get continuous signal intensities to fit such a distribution? No.

Also, you can use limma for RNA-seq (see: limma-voom, which applies some special transformation so you aren't actually fitting raw count data). It works well RNA-seq. Negative binomial is only one way to model RNA-seq data for DE analysis; many packages (e.g. sleuth, limma) don't model it that way.

2

I would add: having a sample large enough you can almost always "hide behind a central limit theorem", thus, rely on normality. Small sample sizes require more accurate assumptions - and I think DESeq was created for small experiments and limma for large ones.

Modelling of microarrays with the normal distribution, I'd say, also relies on the sample size large enough. It is not normal either - I was playing around some microarray data and it is surely not (I had a question here or on stats.stackexchange on this issue). First of all, different microarray experiements have different "level of noise" (technical variance) - and mixing many random variables ~N(mu, sigma_i) where sigma_i is individual does not yield a normally distributed sample.

1

Just to add to the answer. Sleuth assumes log abundances are normally distributed.