Question: What does it mean to model expression data by a probability distribution?
gravatar for ebrudermanver
2.6 years ago by
ebrudermanver60 wrote:

I have very frequently seen in the papers, lecture notes etc. that RNA-Seq data can be modeled by a Poisson distribution, and microarray data can be modeled by a Gaussian distribution, and I haven't given much thought on that. But I recently realized that I don't really understand what that means. Let's say we have a 100 x 20K matrix of RNA-Seq counts where rows represent samples (say, lung cancer patients) and columns represent genes. Then, do we assume that the set of 100 values in each column (gene) would follow a Poisson distribution? Or do we assume that the set of 20K values in each row follows a Poisson distribution? Or, each gene-sample pair is distributed by a separate Poisson with a separate mean? If the last is true, then we have no idea how to compute the mean and variance of those 2 million different distributions, because we have only a single value from each of them.

Also, I have seen many papers where the microarray data is modeled by a p-variate Gaussian distribution where p is the number of genes, although it looks like microarray data is usually assumed to be distributed by a univariate Gaussian. What is the reason behind the multivariate assumption? Does multivariate Gaussian lead to a more accurate modeling of the data?

As you can see, I am totally confused. Can someone explain those in an intuitive and least technical way possible (I am not a statistician)?

ADD COMMENTlink modified 2.6 years ago by Jean-Karim Heriche22k • written 2.6 years ago by ebrudermanver60
gravatar for Jean-Karim Heriche
2.6 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche22k wrote:

Modeling data by a distribution simply means we assume that the numbers come from this distribution. In the case of RNA seq, the read counts are assumed to follow a Poisson distribution. See this post for more on this. For microarray data, the cleaned-up, log-transformed expression levels are often assumed to be Normally-distributed. At the gene level, when the expression is the average over many spots (for some arrays) or samples, the mean tends towards a Gaussian distribution due to the central limit theorem. These are usually simplifying assumptions that make dealing with the data easier. Most of the time, real data have long tails or are over/under-dispersed compared to standard distributions.

ADD COMMENTlink written 2.6 years ago by Jean-Karim Heriche22k

So, relating this to my actual question, are you saying that each column (gene) in the example I gave is distributed by a Gaussian distribution? Why not each row (sample)?

ADD REPLYlink written 2.6 years ago by ebrudermanver60

A sample x gene matrix represents the measured expression levels of the genes in each sample. Depending on how this results was arrived at, each value can be seen as being generated by a Gaussian distribution or even as the mean of such a distribution. In such cases, the rows (samples) can be modeled by a multivariate Gaussian composed of the distributions of all the genes. There is no reason to assume the values in rows/columns to be drawn from the same distribution. If you assume that each row (sample) can be modeled by just one Gaussian, then on average, all genes would have the same expression level. If each column (gene) is modeled by one distribution then each gene will have on average the same expression level in each sample.

ADD REPLYlink written 2.6 years ago by Jean-Karim Heriche22k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1563 users visited in the last hour