Question: Mapping vs Quasi-Mapping
2
2.5 years ago by
pablo160
pablo160 wrote:

Hi guys,

I've got a question : what's the main difference between the mapping and the quasi-mapping? (I used Salmon)

Actually, I read many papers about the subject but I'm still struggling with that.

Both of these kinds of mapping use a suffix array and a hash table (k-mers) to increase the mapping speed.

I also know that quasi-mapping (like Salmon) kills two birds with one stone by mapping and couting reads for each transcripts

So, where is the difference if mapping and quasi mapping use a suffix array and a hash table

Best, Vincent

quasi-mapping salmon • 3.8k views
modified 8 months ago by biochemistry-tehran0 • written 2.5 years ago by pablo160

@Devon has a simple explanation (that does not go into programming concepts behind the two approaches). If you are looking for programming differences then please indicate so.

BTW: `STAR` which is an alignment program can also generate gene counts during alingments.

But I was more looking for programming differences (I read a paper about RapMap algorithm but I didn't find the main information to know what's the main difference between mapping and quasi mapping...)

Tagging: Rob

For me the biggest difference is that mapping gives you bam file which you can work with more (visualize, etc.), and quasi-mapping just gives you counts. You can still get a lot of information from it but it is limited.

Thanks for this very comprehensive answer

Because I also thought the difference was in the fact that salmon used a suffix array and a hash table during the index phase and not the other mappers

But it is now much clear for me

But I still get stuck on a point. When we select the parameter k=29 for example when we index the transcriptome with Salmon, does the hash table is filled only with the 29 bases length k-mers and in this case, is there an overlap between these 29 bases kmer?

In the picture, for example? if the first k-mer from the read doesn't match to the hash table, does Salmon try another k-mer ? In this case, how this new k-mer is computed?

[rapmap_algorithm]

Please use `ADD COMMENT/ADD REPLY` when responding to existing posts to keep threads logically organized.

This comment should be posted under @Rob's answer below.

Also see this: How to add images to a Biostars post

If you index with k=29, then all 29-merd are put in the hash table. Further, longer matches can be found because a length 29 match points to a suffix array interval, which encodes all substrings, of any length, that have this 29-mer as a prefix. If you fail to look up a 29 mer in search, then you simply shify 1 base on the read to the very next 29-mer and try to look up that one. If you find it, you extend the match to the longest possible using the suffix array. If not, then you again move by 1 base to the next 29mer.

Align reads to genes and count no. of reads per gene (Read mapping) OR Count no. of times unique sequences from each gene are present in the reads (Quasi mapping)

15
2.5 years ago by
Rob4.6k
United States
Rob4.6k wrote:

Hi @vincentpailler,

The main difference is in how much computational effort is expended on identifying the optimal alignment of each read. Basically, for certain tasks (e.g., variant detection), it's crucial to know for each read, both where it arises from in the underlying reference, and how it aligns to that reference sequence (since, e.g., repeated mismatches will be used as evidence of the existence of a variant). However, for many problems, including RNA-seq quantification, that precise information isn't necessary. If we know which transcripts a read arises from, and the likely position and orientation of the read on that transcript, then we can use that information for accurate abundance estimation. So, the main differences are in how much effort is expended to find the optimal alignment between the reads and the reference. Additionally, when aligning to the transcriptome (rather than the genome), there are a lot of repetitive mappings, since alternative splicing gives rise to many identical / near-identical alignments for a read. That is, the read may align to different transcripts and positions, but these are produces as a result of alternative splicing, and so the underlying reference sequence is identical. This leads to a lot of redundant work solving identical alignment problems repeatedly. Salmon's mapping algorithm is optimized to deal with such repetition. Recent version even include a `--validateMappings` flag which (optionally) _does_ perform an alignment of the read at the candidate location --- however, to keep this process efficient, it is necessary to track when reads map to identical underlying reference sequence so as to avoid redundant work.

Thanks a lot for that answer

So, the expectation-maximization algorithm used by Salmon is not used by other mappers like Hisat or STAR? And that's why Salmon is much faster

6

The expectation-maximization algorithm (or the variational Bayesian expectation-maximization algorithm) used by Salmon is not, in fact, used by mappers like Hisat or STAR. However, this isn't the main reason for the speed difference.

Hisat and STAR are general aligners. The goal is to take reads as input, and to output SAM/BAM files specifying the alignments of each of these reads. Here, an alignment is a nucleotide-to-nucleotide correspondence between the input read and the reference sequence. This is most often obtained by trying to find candidate locations where the read matches the genome, and then running a dynamic programming-based algorithm to determine the optimal nucleotide-to-nucelotide correspondence between the read and reference at this position. This second phase of the aligners can be slow, and sometimes, the dynamic program determines that this was not a suitable candidate in the first place, meaning that the computational effort was expended without reporting an alignment. The main reasons that salmon is faster than these tools (which were designed for a different and more general purpose applications) are:

1) HISAT and STAR are usually used to perform spliced alignment to the genome. Spliced alignment --- which must account for reads coming from transcripts that map across introns in the genome --- is an inherently more difficult task than non-spliced alignment (which salmon can do because it maps the reads directly to the spliced transcripts in the transcriptome).

2) HISAT and STAR produce full alignments for each read, usually emitted in the form of the CIGAR string in a SAM/BAM file, that specify how each nucleotide of the input read aligns with the reference sequence (i.e., was it a match, mismatch, or was it deleted from the reference or an insertion into it). This requires running a sometimes expensive dynamic program at each candidate alignment location, to determine this optimal correspondence. However, since that information isn't strictly required for quantification, salmon doesn't compute it; it needs only to find the transcripts, positions and orientations that match a read, not exactly how each base of this read corresponds to the reference. This means that it can skip this sometimes expensive step.

3) Salmon is optimized for speed, assuming a generally small and highly-repetitive reference sequence. Base-for-base, salmon's index takes more memory than STAR and much more memory than HISAT. However, since it maps reads to the transcriptome directly, rather than the genome, it has much less sequence to index. Thus, we can afford to spend more space on the index itself (and to do things that make mapping and lookup faster). It is also optimized for the common case in transcriptome mapping where reads map to many places (i.e. multimapping reads). Together, these things allow it to compute the information it needs for quantification more quickly than e.g. STAR / HISAT.

I only have a last question : does Salmon use a Burrows-Wheeler Transform algorithm? (which is like an "amelioration" of a suffix array, right?)

Because I read that mappers like HISAT use it but not Salmon which just uses a suffix array and a hash table

2

HISAT uses a hierarchical FM-index (which was the major data structure innovation of that work). STAR, in fact, uses an uncompressed suffix array, and does not make use of the FM-index or BWT. The fact that one does not need to recover sampled positions as one must do in the sampled FM-index is part of what makes STAR so fast. Salmon uses a combination of an uncompressed suffix array augmented with a hash table to narrow the ranges of the suffix array that need to be searched. As I alluded to above, the data structures used in Salmon are selected for speed at the expense of space. Using a sampled FM-index in place of the uncompressed suffix array would reduce the space required by the Salmon (or STAR) index, but at the expense of slowing down mapping. HISAT is able to remain fast while having a fairly small index because the smaller indices in the hierarchical FM-index are small enough to fit in a typical CPU's cache, and so operations on them are quite fast.