Why NB distribution is preferred in RNA-Seq while normal distribution is preferred in microarray?
1
1
Entering edit mode
2.9 years ago
CY ▴ 710

Most of DE tools (such as DE-Seq) applied in RNA-Seq assume that gene expression follows Negative Binomial distribution (because of both technical and biological variation). While DE tools originated from Microarray (such as limma) assumes normal distribution? Is this difference due to some technical difference between RNA-Seq and Microarray?

RNA-Seq Microarray • 2.4k views
1
Entering edit mode

RNA-seq is count data (discrete), microarray is measured data (continuous). This is a pretty big difference to start with.

0
Entering edit mode

From what I understanding, counting reads from RNA-Seq is like sampling reads aligned on specific gene from reads pool. It represents Poisson process where we have small p (probability) and large n (total reads). Plus we have biological variation between samples. Therefore, we got Poisson with larger variance ~ Negative Binomial distribution. For Microarray data, I imagine we intuitively have the same technical variation (Poisson) and biological variation. Would not this form NB distribution as well instead of normal distribution?

6
Entering edit mode
2.9 years ago
dsull ★ 4.0k

Microarrays aren't poisson processes. You aren't modeling discrete events. We can't think of it like "what is the probability of having k number of reads for a given gene". That's because microarrays are based on continuous signal intensities.

You can think of sequencing reads as success/fail trials (bernoulli -> binomial -> poisson), you can't think of continuous signal intensities that way.

Just because something has variation (actually, all real data has variation) doesn't mean it's Poisson or shot noise. Just look at any graph of the poisson distribution: the random variable is discrete numbers. Can you get continuous signal intensities to fit such a distribution? No.

Also, you can use limma for RNA-seq (see: limma-voom, which applies some special transformation so you aren't actually fitting raw count data). It works well RNA-seq. Negative binomial is only one way to model RNA-seq data for DE analysis; many packages (e.g. sleuth, limma) don't model it that way.

2
Entering edit mode

I would add: having a sample large enough you can almost always "hide behind a central limit theorem", thus, rely on normality. Small sample sizes require more accurate assumptions - and I think DESeq was created for small experiments and limma for large ones.

Modelling of microarrays with the normal distribution, I'd say, also relies on the sample size large enough. It is not normal either - I was playing around some microarray data and it is surely not (I had a question here or on stats.stackexchange on this issue). First of all, different microarray experiements have different "level of noise" (technical variance) - and mixing many random variables ~N(mu, sigma_i) where sigma_i is individual does not yield a normally distributed sample.

0
Entering edit mode

Thanks German. A follow up question, both negative binomial and log-normal distribution are frequently used in DE analysis. if we ignore 'technical noise' for now, we can simplify these two distribution as binomial (or Poisson) and log-normal distribution. My question is: how can both these distributions be valid on describing RNA-Seq data? Put data type aside (discrete and continous), one takes logarithm and another don't.

1
Entering edit mode

log-normal is less adequate. if e.g. our RNAseq is perfect, everything should be distributed according to Poisson and taking logarithm from Possion does not make the distribution normal. Anscombe transformation should be used instead (square root) and log is an overkill. Thus, log transformation is not a universal one for various degrees of overdispersion.

Negative binomial distribution can not be simplified as binomial. As a Poisson with 0 overdispersion - yes.

1
Entering edit mode

Some more discussion on log (and Anscombe) normalization here: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v1.full

0
Entering edit mode

The whole point of fitting a model is to approximate the data. If you say "leave the data type aside", the whole inquiry becomes meaningless.

1
Entering edit mode

Just to add to the answer. Sleuth assumes log abundances are normally distributed.