Question

RNA-seq data: Contradictory counts vs. variance/dispersion trend?

0

Entering edit mode

3.7 years ago

rebeliscu ▴ 60

I've been exposed to what I believe is conflicting information as to relationship between gene read counts and variance or dispersion (perhaps these two terms need to be disentangled?).

For example, in figure 3 of this paper, we can clearly see an increase of SD with the mean: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-304

...And similarly on pg. 47 of this .pdf, though with respect to variance instead of SD: http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf

While in this example, we see the opposite trend: https://www.researchgate.net/figure/Mean-variance-relationships-Gene-wise-means-and-variances-of-RNA-seq-data-are_fig1_260022492

... And similarly here, though with respect to dispersion instead of SD: A: Small dispersion values in differential expression analysis

The relationship between mean reads and dispersion/variance seems to be an important consideration in the topic of RNA-seq data, and typically referenced as common knowledge in this realm. But which of these trends captures this common knowledge?

My prior experience with data in general leads me to believe that higher read counts == more certainty == less variance, but, of course, due the biological variation, this might not be true (though perhaps that's a distinction between variance and dispersion?).

If anyone can provide some insight, I would be very grateful.

RNA-Seq rna-seq • 1.4k views

ADD COMMENT • link updated 3.7 years ago by Kevin Blighe 87k • written 3.7 years ago by rebeliscu ▴ 60

score 1 · Answer 1 · 2020-08-29

The one that you have not cited is one of the original DESeq publications, which should serve as an absolute truth in this regard: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Also, and I am just quoting the developers from their vignette: "Genes with very low counts are not likely to see significant differences typically due to high dispersion" [ source: link1 ]

Further, "The trend typically captures high dispersions for low counts, and therefore these genes exhibit higher shrinkage from the rlog" [ source: link2 ]

See also Gordon's answer on Bioconductor: gene dispersion, what does it mean?

We should keep in mind here that each dataset will bring with it certain 'nuances' relating to an innumerable number of factors, including large differences in library size, unknown batch effects, degraded RNA, personnel bias, errors during sequencing, etc.

Methods like DESeq2, EdgeR, and limma/voom have been shown to model RNA-seq data in a way that leaves minimal error when compared to other programs. No program is truly perfect, though, and neither should they be, as they would then only work on the very datasets on which they are trained / developed (similar to some AI / ML algorithms, which one could argue work 'too well').

My point is that, for certain studies, the traditional expectation for the relationship between dispersion and mean count may not hold true for all genes. This is, however, one of the very issues that these programs aim to address.

Kevin