Building Salmon Index
1
3
Entering edit mode
21 months ago
SeroroO ▴ 30

Hi all!

I've relatively new to RNAseq, so I would really appreciate it if everything can be written as simply as possible. The aim of this post is to clarify a few things with building an index with Salmon, just to ensure that I've done everything correctly. The main gist has to dealing with building an index before quantifying with Salmon. I would be trying to quantify PE total-RNA (.fastq) obtained from Illuminia sequencing experiments downstream.

The reference genome I'm using is that of Arabidopsis thaliana, obtained from Ensemble @ the following link:

This is what the code that I've used on a linux-based computer:

./salmon index -t Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -i aThalianaindex -k 31

I've no issues running the code, but the more I dig online, the more I question the validity of my codes. Could someone clarify the following for me? Or point me towards the right resources, so that I can read up more! (Ps. I've tried reading the salmon documentation but to no avail).

1) Is the cDNA file I've used for building the index the right transcriptomic file to use? If not, which would be the right file. (Hopefully available through Ensemble).

2) I've seen someone merging the cDNA file (presumably the same one that I've used above) and a ncRNA file, and using that merged file to build the index. Why was this done (merging the ncRNA file), should I follow this method instead?

3) I've read on Salmon's own documentation that there are 2 methods of building a 'decoy-aware transcriptome'. Do I have to follow this method strictly? Would my results be significantly affected if I use my method for building an index mentioned above?

4) I've read on Salmon's own documentation that there are 2 methods of building a 'decoy-aware transcriptome'. Referring specifically to the first method using MashMap2, how I do use the script provided on salmon's webpage? (Am actually totally lost on how I should build an index with this method, any guidance from start to end will be helpful)

Any help would be deeply appreciated! Yall can just answer part of the question that you're familiar with. Thanks for your help in advance!

index salmon rna-seq • 4.2k views
4
Entering edit mode
21 months ago
1. Yes, that is the correct file for your current command.

2. They were likely interested in non-coding RNAs (lncRNAs, etc) as well. If you don't particularly care about such genes, you do not have to do this. If you do, then yeah, you can concatenate them.

3. You don't have to, and it can be somewhat annoying to put together. It improves quantification accuracy, but I'm not sure to what degree. I've had good success using both decoy-aware and "standard" indices. The method using MashMap2 is wildly resource intensive and takes quite a while. For human, I had to allocate like 128 GB to get it to complete - it is not a task you will be able to do on your local PC. If you don't have access to a compute cluster, just use the standard method. For human/mouse, there are prepared decoy-aware indices already available for download, but that isn't the case for your organism unfortunately.

4. As for running that script, the usage command is below:

bash generateDecoyTranscriptome.sh [-j <N> =1 default] [-b <bedtools binary path> =bedtools default] [-m <mashmap binary path> =mashmap default] -a <gtf file> -g <genome fasta> -t <txome fasta> -o <output path>

You will need bedtools and mashmap installed and added to your PATH before running it, both of which can be easily installed via conda. You'll also need the full genome sequence for your organism, which you can create from the Ensembl files by concatenating all the chromosome files together. It will output a gentrome.fa and decoys.txt file that can then be used for your quantification.

0
Entering edit mode

Hey! Thanks for the detailed reply and for clarifying all my concerns (even to the extent of writing the script usage command). Really appreciate your help! :)

0
Entering edit mode

Hi Jared,

I am new to RNA-seq but am going to carry out an analysis on some human cancer samples. I noticed you mentioned here that there are prepared decoy-aware indices already for human. I was wondering if you could please direct me to these? I've looked on gencode to find references but I can't seem to find ones that specify they are decoy-aware.

Apologies if this is a silly question. I would be very grateful for any help. Thanks in advance, Alex

0
Entering edit mode

I would just make the index yourself based on the reference you want to use. It is simple, use https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

0
Entering edit mode

alex-bain Pre-made decoy containing salmon indexes for human genome are available from Refgenie site here. You will need to install refgenie application following directions here to download them.