Question

Tutorial:RNASeq tutorial for gene differential expression analysis

15

Entering edit mode

5.0 years ago

Thind amarinder ▴ 340

RNA-seq-tutorial-for-gene-differential-expression-analysis

This tutorial is based on Bioconductor packages, RNAseq gene differential expression analysis, including filtering, preprocessing, visualization, clustering, and Enrichment. I hope it will be useful for new Bioconductor users.

Required data files You should have a raw count and annotation/metadata file for running this analysis (In this tutorial example files are provided). Raw count files are usually obtained from tools such as featureCount, RSem, etc from Bam files.

Bioconductor packages to be installed

DESeq2

edgeR

biomaRt (Useful for gene filtering and annotations)

PCAtools (PCA detailed analysis)

ReactomePA (enrichment analysis)

Note: PCA and Enrichment analysis is based on Deseq2. However, users may be interested in considering only those genes that are commonly differentially expressed between DEseq2 and EdgeR.

R script and further instructions of the tutorial are available here.

https://github.com/amarinderthind/RNA-seq-tutorial-for-gene-differential-expression-analysis

We explained various bulk RNA-Seq applications in our recent review published in Briefing in Bioinformatics. https://doi.org/10.1093/bib/bbab259

Transcriptomics bioconductor RNA-Seq R • 5.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 5.0 years ago by Thind amarinder ▴ 340

0

Entering edit mode

Thank you for putting this together. I have couple of things as feedback that I disagree with though:

Here you use naive CPM calculation instead of TMM (you need to run cpm()) on a edgeR DGEList object for which calcNormFactors() has already been run, and CPMs that are used for PCA should be on the log2 scale. One typically also subsets the genes for those with large variances as a proxy for being different between samples, say the top 500 most variable genes on the log scale. That would then be consistent with this line because the DESeq2 PCA function by default uses the top 500 most variable genes, and since you use vst the data are already on the log scale. This'd make things consistent.

Here I'd rather run the recommended FilterByExpr() filter from edgeR rather than custom filtering to remove low counts as you did on top of the script.

All these three commands (the outputs) are included when running estimateDisp afaik.

The current recommendations of the edgeR authots (based on various Bioconductor posts) is to use the QLF framework rather than the LRT, referring to here.

As said for this here, I'd rather apply FilterByExpr than custom approaches.

In general when writing code that is intended for a tutorial you might want to check out Rmarkdown rather than a plain R script, it is really handy and the produced html reports are awesome both for display and also for code documentation.

I gave my two cents on what is my best practice for standard RNA-seq here: Basic normalization, batch correction and visualization of RNA-seq data

ADD REPLY • link 4.8 years ago by ATpoint 88k

0

Entering edit mode

Thank you very much for the feedback. Yes, there were corrections on PCAtools related lines. After verification of estimateDisp from edgeR documents, I may include FilterByExpr function to edgeR related part and again I need to check what's the best way for deseq2 in that case.

ADD REPLY • link 4.8 years ago by Thind amarinder ▴ 340

0

Entering edit mode

The DESeq2 manual explicitely states that no prefiltering is necessary. The independent filtering of the results function will take care of that internally.

ADD REPLY • link 4.8 years ago by ATpoint 88k

0

Entering edit mode

I see QLF has advantages over LRT, with some exceptions. Mentioned by Aaron Lun here..... where the dispersions are very large and the counts are very small, whereby some of the approximations in the QL framework seem to fail.

ADD REPLY • link 4.8 years ago by Thind amarinder ▴ 340