10 weeks ago by
United States
First, have a look at what the relationship between mean and variance looks like for RNA-seq data: genes with very low read counts tend to have greater variability in their counts than genes with very high gene counts. The reason for that is that the measurement of the gene expression is inherently noisy and we never capture all available transcripts. Let's say there's a gene with exactly 5 transcripts in a given cell. If we're lucky, we might be able to catch all of them in one sample, while in another replicate, where the gene has the same number of transcripts, we may only manage to capture 1 or even 0 transcripts (I'm drastically simplifying here; there are numerous steps along the process where transcripts/read might get "lost"). So, in brief, the mean-variance relationship exists because the sample preparation and library preparation steps seem to have more trouble with reliably quantifying lowly expressed genes.
Here are two great examples from Wikipedia's entry on heteroskedasticity that have nothing to do with sequencing, but may give you a general feeling for what types of situations lend themselves to heteroskedasticity:
Heteroscedasticity often occurs when there is a large difference among the sizes of the observations.
A classic example of heteroscedasticity is that of income versus expenditure on meals. As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.
Imagine you are watching a rocket take off nearby and measuring the distance it has traveled once each second. In the first couple of seconds your measurements may be accurate to the nearest centimeter, say. However, 5 minutes later as the rocket recedes into space, the accuracy of your measurements may only be good to 100 m, because of the increased distance, atmospheric distortion and a variety of other factors. The data you collect would exhibit heteroscedasticity.