Mean Variance Relationship single cell RNA-Seq Data
1
1
Entering edit mode
5 months ago
kw486 ▴ 10

Hi. Can anyone explain concisely why there is a relationship between mean and variance in single cell data?

I keep finding papers/vignettes which say this dependency exists and how to correct for it, but nowhere actually seems to suggest why it's there in the first place.

Thanks.

RNA-Seq variance deseq2 single cell • 601 views
ADD COMMENT
5
Entering edit mode
5 months ago

First, have a look at what the relationship between mean and variance looks like for RNA-seq data: genes with very low read counts tend to have greater variability in their counts than genes with very high gene counts. The reason for that is that the measurement of the gene expression is inherently noisy and we never capture all available transcripts. Let's say there's a gene with exactly 5 transcripts in a given cell. If we're lucky, we might be able to catch all of them in one sample, while in another replicate, where the gene has the same number of transcripts, we may only manage to capture 1 or even 0 transcripts (I'm drastically simplifying here; there are numerous steps along the process where transcripts/read might get "lost"). So, in brief, the mean-variance relationship exists because the sample preparation and library preparation steps seem to have more trouble with reliably quantifying lowly expressed genes.

Here are two great examples from Wikipedia's entry on heteroskedasticity that have nothing to do with sequencing, but may give you a general feeling for what types of situations lend themselves to heteroskedasticity:

Heteroscedasticity often occurs when there is a large difference among the sizes of the observations.

A classic example of heteroscedasticity is that of income versus expenditure on meals. As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

Imagine you are watching a rocket take off nearby and measuring the distance it has traveled once each second. In the first couple of seconds your measurements may be accurate to the nearest centimeter, say. However, 5 minutes later as the rocket recedes into space, the accuracy of your measurements may only be good to 100 m, because of the increased distance, atmospheric distortion and a variety of other factors. The data you collect would exhibit heteroscedasticity.

ADD COMMENT
0
Entering edit mode

Thanks for the response.

When you say "genes with very low read counts tend to have greater variability in their counts than genes with very high gene counts" - my understanding was the opposite, that as the mean increases, the variance also increases. To me this is also what is suggested by the example of expenditure on meals, that as mean wealth increases, the variance of food spend is higher. Perhaps I am misunderstanding the difference between variability and variance? Or are you using them interchangeably?

My question arose from looking at a hbctraining module that touched on normalising in order to control for this heteroskedasticity: enter image description here

ADD REPLY
0
Entering edit mode

You are right, this is all a bit muddled up. For example, see the explanation for the exact same figure you're highlighting:

  1. The mean is not equal to the variance (the scatter of data points does not fall on the diagonal).
  2. For the genes with high mean expression, the variance across replicates tends to be greater than the mean (scatter is above the red line).
  3. For the genes with low mean expression we see quite a bit of scatter. We usually refer to this as “heteroscedasticity”. That is, for a given expression level in the low range we observe a lot of variability in the variance values.

Source: https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/01c_RNAseq_count_distribution.html

I'll try to write a summary of the different types of definitions and characteristics different people/sources tend to refer to, but to focus on the "WHY", I would like to emphasize that the explanation for the problems of RNA-seq count data remains the fact that the quantification of transcripts is inherently noisy.

ADD REPLY
2
Entering edit mode

Alright, so let me know if my write up helps elucidate the issue.

ADD REPLY
1
Entering edit mode

This is incredibly helpful Friederike, I have spent days googling this and your write up helped me tremendously. I couldn't figure out why log-transforming minimized the variance of highly expressed genes so severely. Now I do. It is not ideal, because it dismisses the biological importance of abundant transcripts. I also see that not transforming at all will miss important changes across lowly expressed genes since they will show small variances. Thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6