Question

Details on salmon index

0

Entering edit mode

5 months ago

Lorenzo • 0

Hello guys, I need a clarification about the way in which Salmon creates the index in mapping-based mode. I'm aware that Salmon creates a hash table (h) containing as keys the k-mers from the transcriptome and as values the suffix array (SA) intervals lexicographically sorted. Salmon then scans the reads until a k-mer present in h and mapping to a valid SA interval is found. Because of the lexicographic order of the suffixes in the SA, the software knows that this k-mer is a prefix of all of the suffixes appearing in the given interval. Then, the tools extends this match to the longest substring of the read that appears in the reference (Maximum Mappable Prefix, MMP). So, Salmon can associate the read to a specific transcript.

The process that is not completely clear to me is how Salmon can associate the identified suffix to a specific transcript?

The next question is related to the way in which Salmon identifies the next informative position (NIP). I read the RapMap paper that contains the algorithm utilized by Salmon to identify the NIP but I cannot understand it. Can you explain me please?

Thank you in advance for your time and effort to answer this question :)

Salmon • 739 views

ADD COMMENT • link updated 5 months ago by Rob 6.9k • written 5 months ago by Lorenzo • 0

0

Entering edit mode

You're lucky that the senior developer Rob is often around, so you probably get your experts answer soon.

ADD REPLY • link 5 months ago by ATpoint 85k

score 5 · Accepted Answer · 2024-05-16

Hi Lorenzo,

Thanks for your question. First, I'd like to clarify that several substantial updates have been made to salmon since its initial release, including the underlying index used for mapping (since version 1.0.0, salmon has used pufferfish as it's underlying index) and it has also adopted the selective-alignment algorithm described in this paper. Nonetheless, I'm happy to answer your questions about the RapMap index.

To answer your first question, the basic data structure used by RapMap is a _generalized_ suffix array, augmented with a prefix lookup table. The prefix lookup table is used only to accelerate search (i.e. rather than searching the entire suffix array by binary search, we can perform binary search only in the interval bounded by the k-mer that starts our query); so we can set this optimization aside for a moment, though it's important in practice to speed things up. The distinction between the suffix array and the generalized suffix array is that the generalized suffix array is build over a collection of strings S_1, S_2, ..., S_k, which you can imagine to be our transcripts. First, we construct the string S = S_1$S_2$S_3$...$S_k$. That is we concatenate all of the strings together, separating them with a $. This ensures that no query will match across the boundary between a pair of transcripts (because the $ is not part of the query alphabet which is {A, C, G, T}).

The next component of the index we need is some way, given an index i in the the concatenated string S, to tell to which of the k substrings it belongs. That is, in which substring (transcript) does the character S[i] occur? There are several ways to do this, but the way RapMap handles it is with a data structure called a rank-enabled bitvector. Imagine we build a bitvector (that is a single bit per character in S), where we record a 0 for all indices j where S[j] != $ and a 1 where S[j] = $. This bitvector, B, encodes the "boundaries" between substrings (transcripts). Now, one particularly cool data structure that one can build over such a bitvector is a succinct data structure (i.e. the data structure is small compared to the memory requried by the bitvector itself) that answers the "rank" query. Specifically, the query rank(B, j) tells me how many 1's have occurred in the bitvector B up through index j. Note; in our case, the rank at an index is exactly the ID of the transcript in which that character occurs. Consider an example. Say that we perform a search and find that the MMP of the query matches the suffix array interval I = [s, e) where s and e are just indices into the suffix array. This means the MMP of the query occurs in the text as a prefix of: S[ SA[s], : ], S[ SA[s+1], : ], ..., S[ SA[e-1], :]. So the question we have to answer for each match is, in which transcript does it occur? Well, S[ SA[s], : ] occurs in transcript rank(B, SA[s]), the next occurs in transcript rank(B, SA[s+1]) and so on. So for each match, we simply issue a rank query to transform from a global index in S to the transcript in which that index occurs. Rank operations are also _very_ fast (O(1) asymptotically, and practically fast as well). Finally, while you didn't explicitly ask this, it's also worth noting that we can easily get the relative offset with respect to the transcript as well. This is done using an operation called "select", which is akin to the inverse of rank. The operation select(B, i) gives me the index into B where the i-th 1 occurs. That is, the starting position of the i-th transcript. So, if we know the global position where a match occurs (call it i), then we can compute the position of the match in the transcript as i - select(B, rank(B, i)) — depending on the precise definition of rank and select one uses, one may have to add a +1 or -1 in the prior definition, but this is the main idea.

Finally, to your last question about computing the NIP; this is quite straightforward. Return to our example, and imagine that the MMP of a query matches to the suffix array at interval I = [s, e). Further, assume that the MMP is not equal to the entire query (otherwise there is nothing left to match). Then the NIP is computed as the longest common prefix between S[ SA[s], : ] and S[ SA[e-1], : ]. The longest common prefix of two strings is, as the name suggests, the length of the prefix that they share. This is computed directly by starting at the offset in both of these suffixes (S[ SA[s], : ], S[ SA[e-1], : ]) right after the end of the matched query, and directly comparing these suffixes until they differ. This gives us a number of bases to skip, which determines the next position in the query string which will be used to match against the suffix array.

Hopefully these explanations help answer your questions.