Question

how to build matrices for DeSeq 2 from salmon quan.sf files

0

Entering edit mode

3.2 years ago

slin023 • 0

Greetings, I have generated read abundance files from salmon, when I follow up this guide , I got very confused. 1. first, they built a list of quant.sf files to aid building matrix

% find wt_* GSNO_* -name "quant.sf" | tee quant_files.list

But I only have one quant.sf files, no isoform. How to generate a list to build matrices. Here is my salmon script: enter image description here

then following the guide, to build matrices with this format % $TRINITY_HOME/util/abundance_estimates_to_matrix.pl --est_method salmon \ --out_prefix Trinity --name_sample_by_basedir \ --quant_files quant_files.list \ --gene_trans_map trinity_out_dir/Trinity.fasta.gene_trans_map and use DeSeq2 to perform differential gene expression with this script % $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \ --matrix Trinity.isoform.counts.matrix \ --samples_file data/samples.txt \ --method DESeq2 \ --output DESeq2_trans However, DESeq2 is bioconductor package used in R studio right? I perform on HPC cluster, how is the package applied on "TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl" with a DESeq2 package from R studio? Excuse me for never using R studio in HPC and lack of R studio experience.

If anyone can provide some suggestion, please let me know, thank you for your time!

assembly • 2.1k views

ADD COMMENT • link updated 3.2 years ago by ATpoint 82k • written 3.2 years ago by slin023 • 0

score 0 · Answer 1 · 2021-02-28

Warning: tl:dr will follow but I guess some background information might be appreciated.

=> Salmon quantifies against a transcriptome, so the quant.sf are transcript abundance estimates. One commonly performs analysis with DESeq2 on the gene level though. The reason is that:

a) it is more straight-forward to interpret. Multiple transcripts might show different patterns (some go up, some go down) but it is often not clear what each transcript does in terms of function and how to interpret complex patterns. One is typically interested in the overall change of the gene, therefore sums tx counts to a single value representative for the gene.

b) Transcripts usually share most of the exon sequences between them. That means that is a notable mapping uncertainty that would need to be taken into account when doing tx-level analysis. DESeq2 does not support this. One would make use of the bootstrap replicates that salmon can produce (essentially it checks how reliable a read maps to each transcripts and then generates bootstrap replicates with alternative mapping locations). Specialized software such as swish from the fishpond (Bioconductor) package can then make use of this information to decide how reliable the mapping, and by this observed changes between transcripts are. Tx analysis with DESeq2 (which does not use that information) is suboptimal and in most cases gene level analysis is more informative.

c) As you have fewer genes than transcripts runtime and memory requirements are lower for gene- than transcript analysis, especially when sample size is large, but these days that is probably not a relevant argument anymore.

In any case, you probably want to aggregate the transcript- to the gene level. A common package for this is tximport from Bioconductor, see https://www.bioconductor.org/packages/release/bioc/html/tximport.html

You can then directly load the gene level tximport output into DESeq2 as described in the DESeq2 manual. It contains all necessary code, same goes for the tximport procedure, there is actually no need for external tutorials if you simply read the manuals at Bioconductor.

Finally, RStudio is just an interface, an IDE. R packages can be used from R command line or RStudio basically without difference. I am personally not a fan of wrapper scripts for differential analysis as they do things under the hood you have no control over. I would either use RStudio via the HPC (ssh -X connection) or simply download it RStudio and run both tximport and DESeq2 on a local machine. It is not computationally demanding, any standard laptop can do it unless you have many hundreds of samples. That makes it much easier to do exploratory analysis of your data, following the DESeq2 manual. You simply download the folders salmon produced to your local machines and can then analyse on it. You can gzip-compress the quant.sf files if you are limited in disk space, but generally this should not take much space.

Does that make sense to you?