Question

Precise order for RNA seq transformation and gene clustering.

3

Entering edit mode

3.6 years ago

TonyCN ▴ 60

Hello all, I'm very much working alone on a particular body of work that requires RNA seq analysis. I've been using the tutorial: "RNA-seq workflow: gene-level exploratory analysis and differential expression", with the the Salmon, tximeta, Bioconductor & RNASeq2 set of tools. The document is fantastic, full of information and I am very grateful that it exists. However, I'm almost at the end and despite a lot of side-reading I am still struggling with a good few steps. I don't have an experienced computational biologist to ask, so I listed my questions and was hoping I could get some help from the online community. The responses I've had so far have been extremely helpful.

I've given a brief overview of the steps in the order presented in the document and I've put my questions/concerns in bold.

Straight in at 3.1 Starting from SummarizedExperiment: In the document, a DESeqDataSet object is creating. Using my design this would be:

dds <- DESeqDataSet(gse, design = ~ Drug + batch)

A series of exploratory calculations and visualisations are performed which includes performing transformations with functions that come packaged with DESeq2:

vsd <- vst(dds, blind=TRUE)
rld <- rlog(dds, blind=TRUE)

A good few pages are dedicated to MDS, PCA and heatmaps using both transformations.

Followed by the actual differential expression pipeline:

dds <- DESeq(dds)

This is when I start getting a bit confused and lost. The tutorial then demonstrates two options to restrict gene expression comparison between the two groups. By false discovery rate threshold (padj) or by raising the log2 fold change threshold from 0 using the lfcThreshold. I picked changing the lfcThreshold:

resBromo <- results(dds, contrast = c("Drug","Bromocriptine","DMSO"), lfcThreshold = 1)

Question 1) Must I also still subset resBromo using a particular padj value? e.g.,

resBromo <- subset(resBromo, padj < 0.1)

Or am I right in assuming my FDR has been taken care of by adjusting the lfcThreshold? I have a feeling that I need to subset due to multiple testing conditions, and if that is the case it would be:

resBromo <- results(dds, contrast = c("Drug","Bromocriptine","DMSO"), lfcThreshold = 1)
resBromo <- subset(resBromo, padj < 0.1)

If I do not subset for padj<0.1, I'm left with 25,802 gene rows. If I do subset, it drops way down to 13 genes.

The next step in the tutorial would be clustering diff genes between groups control and case (in my case, "DMSO" and "Bromocriptine", but I also have 13 other drugs to be compared with the DMSO control). In the tutorial, they their code:

topVarGenes <- head(order(rowVars(assay(vsd)), decreasing=TRUE), 20)
mat <- assay(vsd)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col=anno)

As I'm not looking at an X vs, Y, but rather lots of different Ys (independent of one another) looking at the one X control group, I've subsetted out the columns of interest. This is the frist drug:

DMSOIndex <- c(1,2,3)
x <- vsd[,c(DMSOIndex,7,20,41)]
topVarGenes <- head(order(rowVars(assay(x)), decreasing=TRUE), 100)
mat <- assay(x)[topVarGenes,]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)[, c("Drug")])
pheatmap(mat, main="Bromocriptine", fontsize = 8)

Question 2) It feels strange that I'm using the transformed vsd object here, having first performed a good number of lfcThreshold and padj adjustments to the results() functions. Is that right? I can assume that the lfcThedhold and padj adjustments to the results() in the earlier step was an explanatory step only?

Question 3 my burning question) I've used VST to cluster genes. However, the VST transformation function was applied using the dds object from the DESeqDataSet() function. Later, we use that same dds object as an argument of the DESeq() function (that was the order in the document). Should I perform a new transformation of vsd and/or rlog using the dds object after DESeq function processing or the result from the DESeqDataSet as per the document?

Question 4) Although the tutorial uses a VST transformed example, I'm uncertain of whether I should be using rlog instead. When I retrieve the counts of the top gene from the dds object (after calling DESeq) and plot across all my drug samples (13 cases * 3, and then 3 DMSO controls) the normalizes counts are up to 120 (although some are very low). According to the tutorial, that's n>30, thus VST is helpful. However, when I restrict the top gene to a particular drug vs. DMSO, I'm looking at n<30, which suggests rlog. Should I be using different transformations for different case vs control scenarios, or stick with one and if so, which one? Also, I'm a bit concerned with whether plotting rlog data onto the pheatmap() function is appropriate. I read read the detailed reply to this thread Gene-based clustering of RNA-seq data and I'm unsure whether it applies to me.

I appreciate that I've got a lot here, that I'm not an expert in this field, but I'm doing by best and I appreciate any assistance.

RNA-Seq • 1.3k views

ADD COMMENT • link updated 3.4 years ago by ATpoint 82k • written 3.6 years ago by TonyCN ▴ 60

0

Entering edit mode

It's sad that this very well written post has elicited zero responses from the community. It would do well to have some responses here, and have this marked as some sort of a "mini-wiki" or something of that sort for users who'll eventually face the same struggles this poster did.

ADD REPLY • link 3.4 years ago by Dunois ★ 2.5k

0

Entering edit mode

Yeah well, this has been asked so many times before both here and on Bioc, plus is extensively described in the vignette, that you will almost certainly find previous posts if you really go for it using search engines of your choice. The DESeq2 vignette even has it under the FAQs. I understand it is overwhelmingly much to learn if you are new to the field but the answers to standard questions are out there, please try not to "blame" the community for not answering every single question, especially the repetitive (and lengthy) ones. Hope my answer below helps. Please comment if there are questions.

I also tried to summarize a basic RNA-seq workflow using edgeR here Basic normalization, batch correction and visualization of RNA-seq data maybe that is of interest as well, covering basic QC, batch correction and differential testing.

ADD REPLY • link 3.4 years ago by ATpoint 82k

0

Entering edit mode

ATpoint, first and foremost, thank you for posting an answer here, and also linking to that other post of yours!!

I concur that something of this sort has probably been asked before, and that answers/solutions/discussions to just about everything concerning RNAseq is just few searches away. However, to beginners and newcomers, that isn't readily apparent (how effective is a search when one doesn't know or understand the terminology enough to formulate the search criteria effectively?).

Further, as the software tools evolve, looking at solutions presented in old posts becomes less and less helpful as said solutions may no longer be applicable; this just makes a beginner's work all that more arduous.

But I'd also like to point out that in addition to the learning curve being steep, there just simply isn't much beginner-friendly material out there on RNAseq to begin with (that I was able to find, please do correct me if I am wrong). As an example: the DESeq2 and edgeR vignettes are hardly what I would call beginner-friendly. In edgeR's case the document is over a 100 pages long, and in DESeq2's case, non-linear and poorly delineated and bloated (from a newcomer's perspective).

And finally, my intention was not to blame the community. I apologize if it came off as an attack. I have no standing to blame/malign the community given I have contributed nothing to it yet, while having benefited immensely from it.

ADD REPLY • link 3.4 years ago by Dunois ★ 2.5k

score 1 · Answer 1 · 2020-12-06

First off, the reason why this question probably got no responses is the length. Try to be precise while not overloading the post with text, that scares people off. Second, many of the answers are actually answered in the DESeq2 vignette, but lets go through it:

1) The lfc argument is simple the Null hypothesis that you test against. By default the tool tests whether the difference in gene expression is different than zero (so the numeric value 0). In this case here with lfc=1 it tests whether the difference is at least 2-fold (because log2(2)=1). If you filter for a padj of 0.05 then this means that significant genes have at least a fold change of 2 (lfc=1). It is simply a way being more stringent with your Null hypothesis. This is explained: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#tests-of-log2-fold-change-above-or-below-a-threshold

In short: Yes, you still have to filter for padj, but with identical padj threshold you most likely get fewer singificant genes because your testing was more stringent. If you don't filter then the 25thousand-something genes you describe are just all genes, regardless of being differential or not. The 16 would then those being significant at the padj threshold you set. lfc=1 is quite a lot, people usually set it to like lfc=log2(1.3) or so, with the effect that eventually the significant genes have a minumum FC or approximately 1.5 which became kind of a standard threshold in many papers, even though there is not a good rule or even a biological reason why 1.5 was a good choice, it just became a kind of convention. Usually, if you set a lfc threshold then the minimum FC that is still significant is a bit higher, so testing for lfc=log2(1.3) or so then typically gives minumum significant FCs of somewhat 1.4 or 1.5. edgeR also has that option, it is called glmTreat there.

2) vst and results are unrelated. The DESeq() and results() workflow tests for differential expression so whether there is statistical evidence that a gene is DE between conditions. vst and rlog and transformations that you can apply to you data in order to a) normalize the raw counts and b) apply a variance stabilization, as there is usually a dependency between the mean (so the expression level of a gene) and its variance between samples. More details: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization In short: vst and rlog are exclusively for purposes such as QC, clustering, heatmapping, visualization, machine learning purposes, not for differential expression. You may (even should as far as I am concerned) use the differential genes from the DESeq() pipeline for these steps such as clustering and extract the respective counts from the output of these transformation functions. Eventually the differential genes are the one that are of interest, therefore one typically uses them for clustering, using the vst or rlog-transformed counts, or alternatively the normalized counts on log2-scale, but for simplicity, just use vst which are normalized counts, on a log2-like scale and variance-stabilized so perfect for downstream analysis such as clustering.

3) It does not matter afaik, running vst on the dds object should get the same result no matter if running it before or after DESeq() was run on the dds. If you run it before and you set blind=FALSE then it will estimate the dispersion itself, and if after then it will take the dispersion from the dds object.

4) I do not fully understand the question. The thing is that rlog is quite slow if you have a lot of samples, therefore most people I know routinely use vst on the full dataset, and then filter for the samples they need.