How does salmon deal with decoy?
1
1
Entering edit mode
14 months ago
Cindy ▴ 10

I'm using Salmon's selective alignment mode (--validateMappings) with decoy-aware index generated from gencode.transcriptome.fa and decoys.txt. I want to make sure that salmon does not output quantifications for these decoy sequences, correct? These decoy sequences begin with "GL....". They are just to improve the quantifications for those sequences that are actually in the reference. Thanks in advance!

salmon rna RNA-Seq • 2.4k views
6
Entering edit mode
14 months ago
Rob 4.9k

Hi Cindy,

Any target in the indexed FASTA file whose name appears in the decoys.txt file (that is, any target that you tell salmon to be a decoy) will _not_ appear in the quantification output. Reads can be selectively-aligned to decoy sequence, but abundances for decoys are not computed or output. Rather, these sequences are there to serve as potential explanations for reads that map better to these decoy sequences than they do to the annotated transcriptome.

0
Entering edit mode

Can someone please explain to me, what this decoy sequences are? What are the advantages of using salmon with the --decoys parameter?

Can it also be used without them?

6
Entering edit mode

Hi Assa,

Sure. The decoy sequences are regions of the target genome that are sequence similar to annotated transcripts. These are the regions of the genome most likely to cause mismapping (e.g. transcribed pseudogenes, etc.). There are 3 ways to run salmon : (a) with just the annotated transcriptome being indexed (b) with the annotated transcriptome and a small set of decoys computed using MASHMAP to search transcripts against the genome and (c) with the annotated transcriptome and using the entire genome as decoy sequence.

The (a) method requires the fewest resources, (b) requires a good deal of resources to run the MASHMAP step, but the resulting index is similar to that of (a) and it avoids the most obvious cases of misalignment. (c) results in the largest index, but it's the most effective at avoiding potentially spurious mappings.

Salmon can be used without decoy sequences (and sometimes, this is necessary — e.g. in a de novo assembly, there will likely be no possibility for decoys). It can also be run without decoys in reference organisms. It is simply the case that decoys help avoid certain cases of misalignment that can't be adjudicated with the transcriptome alone, and therefore can lead to somewhat more robust estimates of abundance in the presence of the expression of unannotated sequences.

0
Entering edit mode

Thanks, it sounds logical and straightforward. Is this something specific for Salmon, or does Kallisto has also something similar?

1
Entering edit mode

This is specific to salmon.

0
Entering edit mode

Hi Rob, thanks for the explanation. For option (c) - should the decoy genome sequences that are concatenated onto the end of the transcriptome file be masked in any way? or just plain old fasta scaffolds? Cheers!