Building Salmon Index
2
2
Entering edit mode
7 months ago
SeroroO ▴ 20

Hi all!

I've relatively new to RNAseq, so I would really appreciate it if everything can be written as simply as possible. The aim of this post is to clarify a few things with building an index with Salmon, just to ensure that I've done everything correctly. The main gist has to dealing with building an index before quantifying with Salmon. I would be trying to quantify PE total-RNA (.fastq) obtained from Illuminia sequencing experiments downstream.

The reference genome I'm using is that of Arabidopsis thaliana, obtained from Ensemble @ the following link:

This is what the code that I've used on a linux-based computer:

./salmon index -t Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -i aThalianaindex -k 31

I've no issues running the code, but the more I dig online, the more I question the validity of my codes. Could someone clarify the following for me? Or point me towards the right resources, so that I can read up more! (Ps. I've tried reading the salmon documentation but to no avail).

1) Is the cDNA file I've used for building the index the right transcriptomic file to use? If not, which would be the right file. (Hopefully available through Ensemble).

2) I've seen someone merging the cDNA file (presumably the same one that I've used above) and a ncRNA file, and using that merged file to build the index. Why was this done (merging the ncRNA file), should I follow this method instead?

3) I've read on Salmon's own documentation that there are 2 methods of building a 'decoy-aware transcriptome'. Do I have to follow this method strictly? Would my results be significantly affected if I use my method for building an index mentioned above?

4) I've read on Salmon's own documentation that there are 2 methods of building a 'decoy-aware transcriptome'. Referring specifically to the first method using MashMap2, how I do use the script provided on salmon's webpage? (Am actually totally lost on how I should build an index with this method, any guidance from start to end will be helpful)

Any help would be deeply appreciated! Yall can just answer part of the question that you're familiar with. Thanks for your help in advance!

index salmon rna-seq • 1.4k views
4
Entering edit mode
7 months ago
1. Yes, that is the correct file for your current command.

2. They were likely interested in non-coding RNAs (lncRNAs, etc) as well. If you don't particularly care about such genes, you do not have to do this. If you do, then yeah, you can concatenate them.

3. You don't have to, and it can be somewhat annoying to put together. It improves quantification accuracy, but I'm not sure to what degree. I've had good success using both decoy-aware and "standard" indices. The method using MashMap2 is wildly resource intensive and takes quite a while. For human, I had to allocate like 128 GB to get it to complete - it is not a task you will be able to do on your local PC. If you don't have access to a compute cluster, just use the standard method. For human/mouse, there are prepared decoy-aware indices already available for download, but that isn't the case for your organism unfortunately.

4. As for running that script, the usage command is below:

bash generateDecoyTranscriptome.sh [-j <N> =1 default] [-b <bedtools binary path> =bedtools default] [-m <mashmap binary path> =mashmap default] -a <gtf file> -g <genome fasta> -t <txome fasta> -o <output path>

You will need bedtools and mashmap installed and added to your PATH before running it, both of which can be easily installed via conda. You'll also need the full genome sequence for your organism, which you can create from the Ensembl files by concatenating all the chromosome files together. It will output a gentrome.fa and decoys.txt file that can then be used for your quantification.

0
Entering edit mode

Hey! Thanks for the detailed reply and for clarifying all my concerns (even to the extent of writing the script usage command). Really appreciate your help! :)

0
Entering edit mode

Hi Jared,

I am new to RNA-seq but am going to carry out an analysis on some human cancer samples. I noticed you mentioned here that there are prepared decoy-aware indices already for human. I was wondering if you could please direct me to these? I've looked on gencode to find references but I can't seem to find ones that specify they are decoy-aware.

Apologies if this is a silly question. I would be very grateful for any help. Thanks in advance, Alex

0
Entering edit mode

I would just make the index yourself based on the reference you want to use. It is simple, use https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

0
Entering edit mode

alex-bain Pre-made decoy containing salmon indexes for human genome are available from Refgenie site here. You will need to install refgenie application following directions here to download them.

2
Entering edit mode
7 months ago
ATpoint 49k

The most recent versions of salmon allow the decoy-aware index to be built by simply adding the entire reference genome behind the transcriptome fasta file like cat txtome.fa genome.fa > gentrome.fa, only important thing is that the chromosomes come after the transcripts. This you can then index, but index will be larger than the output of generateDecoyTranscriptome. With the latter strategy MashMap and company will scan the genome for regions with somewhat similarity to the actual transcriptome, and then use this as decoy. Therefore, the resulting index is smaller than simply using the entire genome. It is a bit cumbersome indeed. I used the "whole genome" decoy approach since I am spoiled with a HPC that I can run my stuff on where memory is not a limitation.

I am not sure how different results will be if you have no decoy at all, probably some differences but probably not a catastrophe as well. If it is too cumbersome then don't do it. By best knowledge salmon is the only pseudo/selective aligner that currently offers it, definitely a nice feature and it might indeed improve precision of quantifications, but still not a must, rather a can option.

0
Entering edit mode

Wasn't aware of this, good to know!

0
Entering edit mode

Hey, thanks for taking the time to reply! Didn't know of this neat trick from the manual. I might just try this and compare the results to my current index. Have a great day ahead!