DESeq2: vst() and varianceStabilizingTransformation()
Entering edit mode
3.6 years ago

Hi everyone, I'm exploring the DESeq2 package, in particular the varianceStabilizingTransformation function. I can't completely understand the differences between this and the vst function: when should I use them, and why should I prefer one or the other? Thank you

RNA-Seq sva R normalization • 12k views
Entering edit mode
3.6 years ago

The difference is subtle but means that vst() can perform the transformation quicker.

vst() is, in fact, a wrapper function of varianceStabilizingTransformation() - it (vst) first identifies 1000 variables that are 'representative' of the dataset's dispersion trend, and uses the information from these to perform the transformation.

The key parameter in question is:

vst(..., nsub = 1000)


There is also a difference relating to the usage of blind:


This is a wrapper for the varianceStabilizingTransformation (VST) that provides much faster estimation of the dispersion trend used to determine the formula for the VST. The speed-up is accomplished by subsetting to a smaller number of genes in order to estimate this dispersion trend. The subset of genes is chosen deterministically, to span the range of genes' mean normalized count. This wrapper for the VST is not blind to the experimental design: the sample covariate information is used to estimate the global trend of genes' dispersion values over the genes' mean normalized count. It can be made strictly blind to experimental design by first assigning a design of ~1 before running this function, or by avoiding subsetting and using varianceStabilizingTransformation.

However, if you set blind = TRUE for vst(), it seems to set the design to ~ 1 for you:

function (object, blind = TRUE, nsub = 1000, fitType = "parametric") 
        if (blind) {
            design(object) <- ~1
        matrixIn <- FALSE
    vsd <- varianceStabilizingTransformation(object, blind = FALSE)


This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.


Entering edit mode

Thank you very much, it is more clear now. I'm still not very comfortable with bioinformatics, so some concepts appear a bit difficult to understand for me, even if I study vignettes and documentation.


Login before adding your answer.

Traffic: 1817 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6