In high-throughput assays, this limitation can be overcome by pooling information across genes, specifically, by exploiting assumptions about the similarity of the variances of different genes measured in the same experiment [1].

This is a quote from the DESeq2 paper, .

**Update (2018-01-15)**, adding another quote from the DESeq paper,

This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene i from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.

But I haven't really found any justification for this assumption.

Tools for detecting differential expression, e.g. DESeq2, edgeR relies on the assumption that genes with similar expression also have similar variance, that's how they could estimate variance based on very small sample size (e.g. 3) for two groups (control vs disease). Otherwise, it would seem navie to estimate variance based on just three measures.

However, I wonder why this is a valid assumption? Previously, it might be difficult to get a answer, but now we have TCGA, so I plotted the **mean vs standard deviation** (just square root of variance) over all 173 leukemia (LAML) samples from TCGA for all 20,392 gene expression levels downloaded from http://firebrowse.org/.

I also plotted **mean vs variance**, just standard deviation squared, curious about what it would look like.

The dashed line is diagonal, plotted here to just get a sense of the slope of the scatter

I personally find the assumption hard to agree. Any ideas?

**Update (2018-01-15)**: as suggested by @i.sudbery, I added a similar plot in raw read count instead of TPM

IMO, the last plot kind of supports the relationship between the mean and standard deviation. See how the lower bound of standard deviation increases with the mean. Of course, for some genes, the standard deviation is much higher than the average trend. This is expected from biological data and it is also taken care of in DESeq2 where the standard deviation of the outliers are not shrinked.

PS : I think that the dashed lines are misleading and should be removed from your plots.Why would you only focus on the lower bound, then? If the assumption is true, the above data should be essentially a line because, at a given mean expression level, there is only one value for the variance, isn't it?

Yes, DESeq2 handles them separately

But I find it even hard to argue that the majority of the genes follow this assumption qualitatively. I hope to find more support, but I felt this assumption was made mostly in order to leverage the existing of a large number of genes, and proceed with the analysis. Otherwise, with three replicas and a single gene, there isn't really a robust way to estimate the variance.

As i.subery stated in its last comment below, this is more about folowing a trend than an exact relationship. Also, note that the relationship was never described as linear (for edgeR for instance the relationship is assumed as: σ2 = μ + αμ2).

You are right. This assumption is not perfect, but it has been proven helpful in many cases.

Thanks.

I didn't expect the relationship be to linear, either. It could be a non-linear line.

I tend to agree with you. Do you mind giving a specific example that "it has been proven helpful in many cases"? I had this bias that if this assumption doesn't hold, then how do we trust the genes identified as deferentially expression except for some very obvious ones.