**50**wrote:

I have very frequently seen in the papers, lecture notes etc. that RNA-Seq data can be modeled by a Poisson distribution, and microarray data can be modeled by a Gaussian distribution, and I haven't given much thought on that. But I recently realized that I don't really understand what that means. Let's say we have a 100 x 20K matrix of RNA-Seq counts where rows represent samples (say, lung cancer patients) and columns represent genes. Then, do we assume that the set of 100 values in each column (gene) would follow a Poisson distribution? Or do we assume that the set of 20K values in each row follows a Poisson distribution? Or, each gene-sample pair is distributed by a separate Poisson with a separate mean? If the last is true, then we have no idea how to compute the mean and variance of those 2 million different distributions, because we have only a single value from each of them.

Also, I have seen many papers where the microarray data is modeled by a p-variate Gaussian distribution where p is the number of genes, although it looks like microarray data is usually assumed to be distributed by a univariate Gaussian. What is the reason behind the multivariate assumption? Does multivariate Gaussian lead to a more accurate modeling of the data?

As you can see, I am totally confused. Can someone explain those in an intuitive and least technical way possible (I am not a statistician)?

**18k**• written 19 months ago by ebrudermanver •

**50**