Question: Building Salmon Index
gravatar for SeroroO
29 days ago by
SeroroO20 wrote:

Hi all!

I've relatively new to RNAseq, so I would really appreciate it if everything can be written as simply as possible. The aim of this post is to clarify a few things with building an index with Salmon, just to ensure that I've done everything correctly. The main gist has to dealing with building an index before quantifying with Salmon. I would be trying to quantify PE total-RNA (.fastq) obtained from Illuminia sequencing experiments downstream.

The reference genome I'm using is that of Arabidopsis thaliana, obtained from Ensemble @ the following link:

This is what the code that I've used on a linux-based computer:

./salmon index -t Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -i aThalianaindex -k 31

I've no issues running the code, but the more I dig online, the more I question the validity of my codes. Could someone clarify the following for me? Or point me towards the right resources, so that I can read up more! (Ps. I've tried reading the salmon documentation but to no avail).

1) Is the cDNA file I've used for building the index the right transcriptomic file to use? If not, which would be the right file. (Hopefully available through Ensemble).

2) I've seen someone merging the cDNA file (presumably the same one that I've used above) and a ncRNA file, and using that merged file to build the index. Why was this done (merging the ncRNA file), should I follow this method instead?

3) I've read on Salmon's own documentation that there are 2 methods of building a 'decoy-aware transcriptome'. Do I have to follow this method strictly? Would my results be significantly affected if I use my method for building an index mentioned above?

4) I've read on Salmon's own documentation that there are 2 methods of building a 'decoy-aware transcriptome'. Referring specifically to the first method using MashMap2, how I do use the script provided on salmon's webpage? (Am actually totally lost on how I should build an index with this method, any guidance from start to end will be helpful)

Any help would be deeply appreciated! Yall can just answer part of the question that you're familiar with. Thanks for your help in advance!

index rna-seq salmon • 159 views
ADD COMMENTlink modified 29 days ago by ATpoint40k • written 29 days ago by SeroroO20
gravatar for jared.andrews07
29 days ago by
Memphis, TN
jared.andrews077.5k wrote:
  1. Yes, that is the correct file for your current command.

  2. They were likely interested in non-coding RNAs (lncRNAs, etc) as well. If you don't particularly care about such genes, you do not have to do this. If you do, then yeah, you can concatenate them.

  3. You don't have to, and it can be somewhat annoying to put together. It improves quantification accuracy, but I'm not sure to what degree. I've had good success using both decoy-aware and "standard" indices. The method using MashMap2 is wildly resource intensive and takes quite a while. For human, I had to allocate like 128 GB to get it to complete - it is not a task you will be able to do on your local PC. If you don't have access to a compute cluster, just use the standard method. For human/mouse, there are prepared decoy-aware indices already available for download, but that isn't the case for your organism unfortunately.

  4. As for running that script, the usage command is below:

bash [-j <N> =1 default] [-b <bedtools binary path> =bedtools default] [-m <mashmap binary path> =mashmap default] -a <gtf file> -g <genome fasta> -t <txome fasta> -o <output path>

You will need bedtools and mashmap installed and added to your PATH before running it, both of which can be easily installed via conda. You'll also need the full genome sequence for your organism, which you can create from the Ensembl files by concatenating all the chromosome files together. It will output a gentrome.fa and decoys.txt file that can then be used for your quantification.

ADD COMMENTlink modified 29 days ago • written 29 days ago by jared.andrews077.5k

Hey! Thanks for the detailed reply and for clarifying all my concerns (even to the extent of writing the script usage command). Really appreciate your help! :)

ADD REPLYlink written 29 days ago by SeroroO20
gravatar for ATpoint
29 days ago by
ATpoint40k wrote:

The most recent versions of salmon allow the decoy-aware index to be built by simply adding the entire reference genome behind the transcriptome fasta file like cat txtome.fa genome.fa > gentrome.fa, only important thing is that the chromosomes come after the transcripts. This you can then index, but index will be larger than the output of generateDecoyTranscriptome. With the latter strategy MashMap and company will scan the genome for regions with somewhat similarity to the actual transcriptome, and then use this as decoy. Therefore, the resulting index is smaller than simply using the entire genome. It is a bit cumbersome indeed. I used the "whole genome" decoy approach since I am spoiled with a HPC that I can run my stuff on where memory is not a limitation.

I am not sure how different results will be if you have no decoy at all, probably some differences but probably not a catastrophe as well. If it is too cumbersome then don't do it. By best knowledge salmon is the only pseudo/selective aligner that currently offers it, definitely a nice feature and it might indeed improve precision of quantifications, but still not a must, rather a can option.

ADD COMMENTlink modified 29 days ago • written 29 days ago by ATpoint40k

Wasn't aware of this, good to know!

ADD REPLYlink written 29 days ago by jared.andrews077.5k

Hey, thanks for taking the time to reply! Didn't know of this neat trick from the manual. I might just try this and compare the results to my current index. Have a great day ahead!

ADD REPLYlink written 28 days ago by SeroroO20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1880 users visited in the last hour