Tutorial: RNASeq tutorial for gene differential expression analysis
6
gravatar for Thind amarinder
5 months ago by
Thind amarinder110 wrote:

RNA-seq-tutorial-for-gene-differential-expression-analysis

This tutorial is based on Bioconductor packages, RNAseq gene differential expression analysis, including filtering, preprocessing, visualization, clustering, and Enrichment. In case you have any queries or questions, please feel free to ask or correct. I hope it will be useful for new Bioconductor users.

Required data files You should have a raw count and annotation/metadata file for running this analysis (In this tutorial example files are provided). Raw count files are usually obtained from tools such as featureCount, RSem, etc from Bam files.

Bioconductor packages to be installed

DESeq2

edgeR

biomaRt (Useful for gene filtering and annotations)

PCAtools (PCA detailed analysis)

ReactomePA (enrichment analysis)

Note: PCA and Enrichment analysis is based on Deseq2. However, users may be interested in considering only those genes that are commonly differentially expressed between DEseq2 and EdgeR.

R script and further instructions of the tutorial are available here.

https://github.com/amarinderthind/RNA-seq-tutorial-for-gene-differential-expression-analysis

ADD COMMENTlink modified 3 months ago • written 5 months ago by Thind amarinder110

Thank you for putting this together. I have couple of things as feedback that I disagree with though:

Here you use naive CPM calculation instead of TMM (you need to run cpm()) on a edgeR DGEList object for which calcNormFactors() has already been run, and CPMs that are used for PCA should be on the log2 scale. One typically also subsets the genes for those with large variances as a proxy for being different between samples, say the top 500 most variable genes on the log scale. That would then be consistent with this line because the DESeq2 PCA function by default uses the top 500 most variable genes, and since you use vst the data are already on the log scale. This'd make things consistent.

Here I'd rather run the recommended FilterByExpr() filter from edgeR rather than custom filtering to remove low counts as you did on top of the script.

All these three commands (the outputs) are included when running estimateDisp afaik.

The current recommendations of the edgeR authots (based on various Bioconductor posts) is to use the QLF framework rather than the LRT, referring to here.

As said for this here, I'd rather apply FilterByExpr than custom approaches.

In general when writing code that is intended for a tutorial you might want to check out Rmarkdown rather than a plain R script, it is really handy and the produced html reports are awesome both for display and also for code documentation.

I gave my two cents on what is my best practice for standard RNA-seq here: Basic normalization, batch correction and visualization of RNA-seq data

ADD REPLYlink written 4 months ago by ATpoint44k

Thank you very much for the feedback. Yes, there were corrections on PCAtools related lines. After verification of estimateDisp from edgeR documents, I may include FilterByExpr function to edgeR related part and again I need to check what's the best way for deseq2 in that case.

ADD REPLYlink written 4 months ago by Thind amarinder110

The DESeq2 manual explicitely states that no prefiltering is necessary. The independent filtering of the results function will take care of that internally.

ADD REPLYlink written 4 months ago by ATpoint44k

I see QLF has advantages over LRT, with some exceptions. Mentioned by Aaron Lun here..... where the dispersions are very large and the counts are very small, whereby some of the approximations in the QL framework seem to fail.

ADD REPLYlink written 3 months ago by Thind amarinder110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 923 users visited in the last hour