Question: What does it mean to model expression data by a probability distribution?
1
gravatar for ebrudermanver
19 months ago by
ebrudermanver50 wrote:

I have very frequently seen in the papers, lecture notes etc. that RNA-Seq data can be modeled by a Poisson distribution, and microarray data can be modeled by a Gaussian distribution, and I haven't given much thought on that. But I recently realized that I don't really understand what that means. Let's say we have a 100 x 20K matrix of RNA-Seq counts where rows represent samples (say, lung cancer patients) and columns represent genes. Then, do we assume that the set of 100 values in each column (gene) would follow a Poisson distribution? Or do we assume that the set of 20K values in each row follows a Poisson distribution? Or, each gene-sample pair is distributed by a separate Poisson with a separate mean? If the last is true, then we have no idea how to compute the mean and variance of those 2 million different distributions, because we have only a single value from each of them.

Also, I have seen many papers where the microarray data is modeled by a p-variate Gaussian distribution where p is the number of genes, although it looks like microarray data is usually assumed to be distributed by a univariate Gaussian. What is the reason behind the multivariate assumption? Does multivariate Gaussian lead to a more accurate modeling of the data?

As you can see, I am totally confused. Can someone explain those in an intuitive and least technical way possible (I am not a statistician)?

ADD COMMENTlink modified 19 months ago by Jean-Karim Heriche18k • written 19 months ago by ebrudermanver50
3
gravatar for Jean-Karim Heriche
19 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

Modeling data by a distribution simply means we assume that the numbers come from this distribution. In the case of RNA seq, the read counts are assumed to follow a Poisson distribution. See this post for more on this. For microarray data, the cleaned-up, log-transformed expression levels are often assumed to be Normally-distributed. At the gene level, when the expression is the average over many spots (for some arrays) or samples, the mean tends towards a Gaussian distribution due to the central limit theorem. These are usually simplifying assumptions that make dealing with the data easier. Most of the time, real data have long tails or are over/under-dispersed compared to standard distributions.

ADD COMMENTlink written 19 months ago by Jean-Karim Heriche18k

So, relating this to my actual question, are you saying that each column (gene) in the example I gave is distributed by a Gaussian distribution? Why not each row (sample)?

ADD REPLYlink written 19 months ago by ebrudermanver50

A sample x gene matrix represents the measured expression levels of the genes in each sample. Depending on how this results was arrived at, each value can be seen as being generated by a Gaussian distribution or even as the mean of such a distribution. In such cases, the rows (samples) can be modeled by a multivariate Gaussian composed of the distributions of all the genes. There is no reason to assume the values in rows/columns to be drawn from the same distribution. If you assume that each row (sample) can be modeled by just one Gaussian, then on average, all genes would have the same expression level. If each column (gene) is modeled by one distribution then each gene will have on average the same expression level in each sample.

ADD REPLYlink written 19 months ago by Jean-Karim Heriche18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1868 users visited in the last hour