7 months ago by
The main difference is in how much computational effort is expended on identifying the optimal alignment of each read. Basically, for certain tasks (e.g., variant detection), it's crucial to know for each read, both where it arises from in the underlying reference, and how it aligns to that reference sequence (since, e.g., repeated mismatches will be used as evidence of the existence of a variant). However, for many problems, including RNA-seq quantification, that precise information isn't necessary. If we know which transcripts a read arises from, and the likely position and orientation of the read on that transcript, then we can use that information for accurate abundance estimation. So, the main differences are in how much effort is expended to find the optimal alignment between the reads and the reference. Additionally, when aligning to the transcriptome (rather than the genome), there are a lot of repetitive mappings, since alternative splicing gives rise to many identical / near-identical alignments for a read. That is, the read may align to different transcripts and positions, but these are produces as a result of alternative splicing, and so the underlying reference sequence is identical. This leads to a lot of redundant work solving identical alignment problems repeatedly. Salmon's mapping algorithm is optimized to deal with such repetition. Recent version even include a
--validateMappings flag which (optionally) _does_ perform an alignment of the read at the candidate location --- however, to keep this process efficient, it is necessary to track when reads map to identical underlying reference sequence so as to avoid redundant work.
7 months ago by
Rob ♦ 3.2k