Question

What is the importance of the rlog function (DESeq2) for downstream analysis?

1

Entering edit mode

3.8 years ago

Aspire ▴ 300

DESeq2 vignette states

The point of these two transformations, the VST and the rlog, is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low.

and the documentation of rlog explains

The transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis

I understand that "checking for outliers" means checking for outliers via a PCA plot (or something similar).

Why is minimizing differences (between samples) for rows with low counts important for the PCA plot?

Why does the variance have to be independent of the mean (homoscedasticity) for that?

deseq2 • 1.6k views

ADD COMMENT • link updated 3.8 years ago by i.sudbery 19k • written 3.8 years ago by Aspire ▴ 300

score 6 · Accepted Answer · 2020-06-22

In variance based analyses, like PCA, clustering and LDA, the results are driven by those features with the highest variance. In count-based data, like RNA-seq, there is a relationship between the mean and the variance - higher mean = higher variance when the data is on the linear scale.If you were to run something like PCA on the linear scale, you would simply find that the result was dominated by the random noise in the high mean features.

However, because of the discrete nature of the data, on the log scale, the variance is higher in the low count features: 2 reads is twice as many as 1, and 1 read is infinitely more reads than 0, but 101 reads is only 1% more than 100. Thus, without some kind regularization, your PCA will be dominated by very small changes in low count features.