2.5 years ago by
United States
Hi @vincentpailler,
The main difference is in how much computational effort is expended on identifying the optimal alignment of each read. Basically, for certain tasks (e.g., variant detection), it's crucial to know for each read, both where it arises from in the underlying reference, and how it aligns to that reference sequence (since, e.g., repeated mismatches will be used as evidence of the existence of a variant). However, for many problems, including RNA-seq quantification, that precise information isn't necessary. If we know which transcripts a read arises from, and the likely position and orientation of the read on that transcript, then we can use that information for accurate abundance estimation. So, the main differences are in how much effort is expended to find the optimal alignment between the reads and the reference. Additionally, when aligning to the transcriptome (rather than the genome), there are a lot of repetitive mappings, since alternative splicing gives rise to many identical / near-identical alignments for a read. That is, the read may align to different transcripts and positions, but these are produces as a result of alternative splicing, and so the underlying reference sequence is identical. This leads to a lot of redundant work solving identical alignment problems repeatedly. Salmon's mapping algorithm is optimized to deal with such repetition. Recent version even include a --validateMappings
flag which (optionally) _does_ perform an alignment of the read at the candidate location --- however, to keep this process efficient, it is necessary to track when reads map to identical underlying reference sequence so as to avoid redundant work.
•
link
written
2.5 years ago by
Rob ♦ 4.6k
@Devon has a simple explanation (that does not go into programming concepts behind the two approaches). If you are looking for programming differences then please indicate so.
BTW:
STAR
which is an alignment program can also generate gene counts during alingments.Thanks for answering me
But I was more looking for programming differences (I read a paper about RapMap algorithm but I didn't find the main information to know what's the main difference between mapping and quasi mapping...)
Tagging: Rob
For me the biggest difference is that mapping gives you bam file which you can work with more (visualize, etc.), and quasi-mapping just gives you counts. You can still get a lot of information from it but it is limited.
Thanks for this very comprehensive answer
Because I also thought the difference was in the fact that salmon used a suffix array and a hash table during the index phase and not the other mappers
But it is now much clear for me
Thanks for answering me
But I still get stuck on a point. When we select the parameter k=29 for example when we index the transcriptome with Salmon, does the hash table is filled only with the 29 bases length k-mers and in this case, is there an overlap between these 29 bases kmer?
In the picture, for example? if the first k-mer from the read doesn't match to the hash table, does Salmon try another k-mer ? In this case, how this new k-mer is computed?
[rapmap_algorithm]
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.This comment should be posted under @Rob's answer below.
Also see this: How to add images to a Biostars post
If you index with k=29, then all 29-merd are put in the hash table. Further, longer matches can be found because a length 29 match points to a suffix array interval, which encodes all substrings, of any length, that have this 29-mer as a prefix. If you fail to look up a 29 mer in search, then you simply shify 1 base on the read to the very next 29-mer and try to look up that one. If you find it, you extend the match to the longest possible using the suffix array. If not, then you again move by 1 base to the next 29mer.
Align reads to genes and count no. of reads per gene (Read mapping) OR Count no. of times unique sequences from each gene are present in the reads (Quasi mapping)