What does it mean to model expression data by a probability distribution?
1
2
Entering edit mode
3.9 years ago

I have very frequently seen in the papers, lecture notes etc. that RNA-Seq data can be modeled by a Poisson distribution, and microarray data can be modeled by a Gaussian distribution, and I haven't given much thought on that. But I recently realized that I don't really understand what that means. Let's say we have a 100 x 20K matrix of RNA-Seq counts where rows represent samples (say, lung cancer patients) and columns represent genes. Then, do we assume that the set of 100 values in each column (gene) would follow a Poisson distribution? Or do we assume that the set of 20K values in each row follows a Poisson distribution? Or, each gene-sample pair is distributed by a separate Poisson with a separate mean? If the last is true, then we have no idea how to compute the mean and variance of those 2 million different distributions, because we have only a single value from each of them.

Also, I have seen many papers where the microarray data is modeled by a p-variate Gaussian distribution where p is the number of genes, although it looks like microarray data is usually assumed to be distributed by a univariate Gaussian. What is the reason behind the multivariate assumption? Does multivariate Gaussian lead to a more accurate modeling of the data?

As you can see, I am totally confused. Can someone explain those in an intuitive and least technical way possible (I am not a statistician)?

RNA-Seq distribution microarray • 1.6k views
3
Entering edit mode
3.9 years ago

Modeling data by a distribution simply means we assume that the numbers come from this distribution. In the case of RNA seq, the read counts are assumed to follow a Poisson distribution. See this post for more on this. For microarray data, the cleaned-up, log-transformed expression levels are often assumed to be Normally-distributed. At the gene level, when the expression is the average over many spots (for some arrays) or samples, the mean tends towards a Gaussian distribution due to the central limit theorem. These are usually simplifying assumptions that make dealing with the data easier. Most of the time, real data have long tails or are over/under-dispersed compared to standard distributions.

0
Entering edit mode

So, relating this to my actual question, are you saying that each column (gene) in the example I gave is distributed by a Gaussian distribution? Why not each row (sample)?

0
Entering edit mode

A sample x gene matrix represents the measured expression levels of the genes in each sample. Depending on how this results was arrived at, each value can be seen as being generated by a Gaussian distribution or even as the mean of such a distribution. In such cases, the rows (samples) can be modeled by a multivariate Gaussian composed of the distributions of all the genes. There is no reason to assume the values in rows/columns to be drawn from the same distribution. If you assume that each row (sample) can be modeled by just one Gaussian, then on average, all genes would have the same expression level. If each column (gene) is modeled by one distribution then each gene will have on average the same expression level in each sample.