Question: What does it mean to model expression data by a probability distribution?
1
2.6 years ago by
ebrudermanver60 wrote:

I have very frequently seen in the papers, lecture notes etc. that RNA-Seq data can be modeled by a Poisson distribution, and microarray data can be modeled by a Gaussian distribution, and I haven't given much thought on that. But I recently realized that I don't really understand what that means. Let's say we have a 100 x 20K matrix of RNA-Seq counts where rows represent samples (say, lung cancer patients) and columns represent genes. Then, do we assume that the set of 100 values in each column (gene) would follow a Poisson distribution? Or do we assume that the set of 20K values in each row follows a Poisson distribution? Or, each gene-sample pair is distributed by a separate Poisson with a separate mean? If the last is true, then we have no idea how to compute the mean and variance of those 2 million different distributions, because we have only a single value from each of them.

Also, I have seen many papers where the microarray data is modeled by a p-variate Gaussian distribution where p is the number of genes, although it looks like microarray data is usually assumed to be distributed by a univariate Gaussian. What is the reason behind the multivariate assumption? Does multivariate Gaussian lead to a more accurate modeling of the data?

As you can see, I am totally confused. Can someone explain those in an intuitive and least technical way possible (I am not a statistician)?

modified 2.6 years ago by Jean-Karim Heriche22k • written 2.6 years ago by ebrudermanver60
3
2.6 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche22k wrote:

Modeling data by a distribution simply means we assume that the numbers come from this distribution. In the case of RNA seq, the read counts are assumed to follow a Poisson distribution. See this post for more on this. For microarray data, the cleaned-up, log-transformed expression levels are often assumed to be Normally-distributed. At the gene level, when the expression is the average over many spots (for some arrays) or samples, the mean tends towards a Gaussian distribution due to the central limit theorem. These are usually simplifying assumptions that make dealing with the data easier. Most of the time, real data have long tails or are over/under-dispersed compared to standard distributions.

So, relating this to my actual question, are you saying that each column (gene) in the example I gave is distributed by a Gaussian distribution? Why not each row (sample)?