Question

Questions regarding VST/rlog correction to plot sample distances between RNA-seq samples

2

Entering edit mode

5.7 years ago

salamandra ▴ 550

I would like to get sample distances between different samples of an RNA-seq experiment. Read that VST and rlog function of DEseq R package were good to make a correction so that standard deviation of expression of a gene across all samples doesn't change with the mean (of expression of that gene across all samples). My questions are:

1 - Should these corrections be applied after normalising raw counts for sequencing depth (with the DESeq() function) or directly applied on the raw data?

2 - To do a heatmap with a dendrogram representing the distances between samples, is it better to plot in a heatmap the values corrected with VST/rlog or FPKM values?

3 - 'VST' method seems to be better for big sets (n>30). I have 3 samples, so that means need to choose 'rlog' instead?

4 - In both methods we can set parameter 'blind'. Should I set it to 'TRUE' or 'FALSE' in which situations?

Regards.

RNA-Seq R DEseq • 4.9k views

ADD COMMENT • link updated 5.7 years ago by Kevin Blighe 87k • written 5.7 years ago by salamandra ▴ 550

score 8 · Accepted Answer · 2018-07-21

1 - Should these corrections be applied after normalising raw counts for sequencing depth (with the DESeq() function) or directly applied on the raw data?

These should only be applied to the normalised expression levels ('counts'), as per the DESeq2 vignette.

2 - To do a heatmap with a dendrogram representing the distances between samples, is it better to plot in a heatmap the values corrected with VST/rlog or FPKM values?

Don't use FPKM values - the method of normalisation that produces FPKM expression levels should no longer be used for multi-sample studies. Instead, use either the VST- or rlog-transformed counts. Please see the answer that I gave earlier today: A: How to graphically tell if data has been normalized?

Edit April 21, 2020: if you must use FPKM in a heatmap or for any downstream application, I would transform them to z scale via zFPKM package

3 - 'VST' method seems to be better for big sets (n>30). I have 3 samples, so that means need to choose 'rlog' instead?

You can justify the use of either. rlog is not recommended for large datasets because it can take a very long time. I tend to check both, where possible, and find that results don't largely change between both of these methods (provided that there are no outliers in your dataset).

4 - In both methods we can set parameter 'blind'. Should I set it to 'TRUE' or 'FALSE' in which situations?

Please read: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation

Kevin