Questions regarding VST/rlog correction to plot sample distances between RNA-seq samples
1
2
Entering edit mode
6.0 years ago
salamandra ▴ 550

I would like to get sample distances between different samples of an RNA-seq experiment. Read that VST and rlog function of DEseq R package were good to make a correction so that standard deviation of expression of a gene across all samples doesn't change with the mean (of expression of that gene across all samples). My questions are:

1 - Should these corrections be applied after normalising raw counts for sequencing depth (with the DESeq() function) or directly applied on the raw data?

2 - To do a heatmap with a dendrogram representing the distances between samples, is it better to plot in a heatmap the values corrected with VST/rlog or FPKM values?

3 - 'VST' method seems to be better for big sets (n>30). I have 3 samples, so that means need to choose 'rlog' instead?

4 - In both methods we can set parameter 'blind'. Should I set it to 'TRUE' or 'FALSE' in which situations?

Regards.

RNA-Seq R DEseq • 5.1k views
ADD COMMENT
8
Entering edit mode
6.0 years ago

1 - Should these corrections be applied after normalising raw counts for sequencing depth (with the DESeq() function) or directly applied on the raw data?

These should only be applied to the normalised expression levels ('counts'), as per the DESeq2 vignette.

2 - To do a heatmap with a dendrogram representing the distances between samples, is it better to plot in a heatmap the values corrected with VST/rlog or FPKM values?

Don't use FPKM values - the method of normalisation that produces FPKM expression levels should no longer be used for multi-sample studies. Instead, use either the VST- or rlog-transformed counts. Please see the answer that I gave earlier today: A: How to graphically tell if data has been normalized?

Edit April 21, 2020: if you must use FPKM in a heatmap or for any downstream application, I would transform them to z scale via zFPKM package

3 - 'VST' method seems to be better for big sets (n>30). I have 3 samples, so that means need to choose 'rlog' instead?

You can justify the use of either. rlog is not recommended for large datasets because it can take a very long time. I tend to check both, where possible, and find that results don't largely change between both of these methods (provided that there are no outliers in your dataset).

4 - In both methods we can set parameter 'blind'. Should I set it to 'TRUE' or 'FALSE' in which situations?

Please read: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#blind-dispersion-estimation

Kevin

ADD COMMENT
0
Entering edit mode

Thank you. Noticed now that, if we apply VST on the raw values :

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ condition)
vsd <- vst(ddsHTSeq, blind = FALSE)
head(assay(vsd), 3)

gives the same results when applying to normalised values:

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ condition)
ddsHTSeq <- DESeq(ddsHTSeq)
vsd <- vst(ddsHTSeq, blind = FALSE)
head(assay(vsd), 3)

And in section 4.2. of this tutorial it seems it's applied to data before DESeq() is applied, so maybe it does not matter if it's normalised or not?

ADD REPLY
0
Entering edit mode

Right, as:

Sequencing depth correction is done automatically for the vst and rlog

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6