In high-throughput assays, this limitation can be overcome by pooling information across genes, specifically, by exploiting assumptions about the similarity of the variances of different genes measured in the same experiment .
This is a quote from the DESeq2 paper, .
Update (2018-01-15), adding another quote from the DESeq paper,
This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene i from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.
But I haven't really found any justification for this assumption.
Tools for detecting differential expression, e.g. DESeq2, edgeR relies on the assumption that genes with similar expression also have similar variance, that's how they could estimate variance based on very small sample size (e.g. 3) for two groups (control vs disease). Otherwise, it would seem navie to estimate variance based on just three measures.
However, I wonder why this is a valid assumption? Previously, it might be difficult to get a answer, but now we have TCGA, so I plotted the mean vs standard deviation (just square root of variance) over all 173 leukemia (LAML) samples from TCGA for all 20,392 gene expression levels downloaded from http://firebrowse.org/.
I also plotted mean vs variance, just standard deviation squared, curious about what it would look like.
The dashed line is diagonal, plotted here to just get a sense of the slope of the scatter
I personally find the assumption hard to agree. Any ideas?
Update (2018-01-15): as suggested by @i.sudbery, I added a similar plot in raw read count instead of TPM