How is the FM-index built for multiple entries of a fasta file?
3
0
Entering edit mode
4.9 years ago
yifangt86 ▴ 60

I was wondering if anybody can explain how the FM-index was created for multiple entries of a sequence file (say, fasta format for a reference genome containing multiple chromosomes). In other words, how does the FM-index distinguish different sequences in a single fasta file?

While I was studying the burrows wheeler transformation (BWT) and FM-index on sequence manipulation, I seem to understand the idea of the BWT and FM-index for single string/sequence. 1) All the codes from github I can find are using single string/sequence as examples including the original FM index paper.
2) It seems to me a common strategy is to concatenate the sequences to get a single one, which brings to my original question: how does the FM-index distinguish the original entries? Remember the offset when they are joined? 3) The two programs bwa and bowtie2 are two complicated for me to understand the details.

I am trying to understand the implementation of FM-index in C/C++ to create the FM-index for sequences/strings. Thanks in advance.

sequence index software code • 2.5k views
0
Entering edit mode
4.9 years ago
d-cameron ★ 2.4k

1) the bwa implementation treats the genome as a single linear sequence

2) in bwa, this is stored as text in the .amb and .ann files. Each aligner will have their own conventions

3) software implementations take some time to familiarise yourself with. In the case of bwa, short variable names, terse comments, lack of documentation of functions and function arguments, and extensive code optimisations (such as bit packing) make it difficult understand what's going on. It's not just you that struggles to follow exactly what's going on.

0
Entering edit mode
4.9 years ago
Rob 5.1k

Hi yifangt86,

There are a few ways to go about building a multi-string BWT. Perhaps the simplest is the following. Consider a collection of N original strings, S1, S2, ... SN. Now, form a new string S = S1$S2$...$SN (you can use any separator that is a character that doesn't appear in the text). You also keep a bit vector B of the same length as S, which has a 1 wherever S has a$ and a 0 everywhere else. Now, you build the BWT on S. The fact that \$ doesn't occur in your alphabet prevents you from backward searching a pattern than spans the border between two strings. When you have the global position, say p, for an occurrence of a pattern in S, you can determine which of the input strings the pattern occurs in by performing a rank operation on B[p]. The rank returns the number of 1s in B up to and including position p, which is equal to the index of the input string where this pattern starts. There are a few different ways to build multi-string BWTs, but I always found this one most straightforward.

--Rob

0
Entering edit mode

In the past week I read some papers and notes. FM index contains 1) First column of the BWT transformed matrix (F column) which is lexicographically sorted, so only need A/C/T/G four characters; 2) The last column of BWT matrix (L column) which is the same length of the reference genome string(s). 3) The suffix array SA[] (or, only a sample of the SA[] to save space, from which the full SA[] can be calculated on-fly(?)); 4) Checkpoints table (occurrences/rank/tally table) of each character (A/C/G/T). The L column string can be compressed using wavelet tree or otherdata structures etc, but I am not sure. Here is a code fragment of the fm-index data structure from github I found:

 typedef struct _fmi {
char *bwt;
int *idxs;
int **rank_index;
unsigned char* lookup;
int endloc;
int C[5];
int len;
} fm_index;


Two questions that maybe more specific: 1) How the index is created for the genome (To simplify the process, only consider one string T at this moment)? 2) How the search is implemented against the FM-index from 1)? For example: given pattern P=ACGTCACA, how the index is searched for P.

Thanks a lot!

1
Entering edit mode

Regarding your first question --- there are many different algorithms for generating the BWT+FM-index. Different tools will make use of different algorithms. One approach is to build the suffix array (again, there are many algorithms for this like DC3, and other suffix sorting algorithms; this one is very popular). From the suffix array, one can immediately extract the L column of the BWT since there is a trivial correspondence between them (see equation 1 on the second page 2 of Ben's notes here). Regardless, once you have the SA and the BWT, computing the rank table is straight forward. In reality, there are different approaches that trade off construction speed for construction memory etc. Conceptually, however, building the suffix array and then deriving everything you need from it is straightforward.

Regarding your second question, you search for a pattern P in the BWT using "backward search". Ben Langmead has some brilliant slides illustrating the idea; look at slides 26 and on here. You basically start from the end of the string, and try to find increasing length suffixes of the pattern that appear in your text. The FM index lets you extend the length of the suffix you search in constant time per-character. If you can extend the suffix length to be all of P, you've immediately found that your pattern appears in the text, and the range of BWT rows in which it appears (when paired with your suffix array) let you enumerate those actual occurrences.

0
Entering edit mode

Hi Rob! Thank you so much for your time answering my tedious questions. I seem to understand the algorithms described in Ben Langmead's notes you mentioned, and I should be more specific with my questions which were actually on the details of implementation. I am studying the C code at this moment trying to understand each step in detail.

1. The FM index is a compresses string (BWT string, or L column) plus some other auxiliary components including the C[c] table, Occ(c. n) table. How is the FM-index stored in the memory? Maybe this is more of a general computer science question, Sorry for this dumb one.
2. What I meant in my second question is searching: how P=ACGTCACA is fed into the index file at searching step? In regular search, the subject (S, say genome) is searched for a pattern (P, here) using pointer in C, for example. I spent some time trying figure out how the search of P on the FM-index is done in the source code of the programs, but could not figure it out myself.

Using above fm_index struct as example, my questions can be:

• 1) How is the FM-index file saved, and
• 2) which part of the fm_index struct is searched for P?

Thank you very much again for your help!

0
Entering edit mode
4.9 years ago
yifangt86 ▴ 60

Thanks d-cameron, my question should fit into beginners category on learning the implementation of these important data structures & algorithms, which seem to be quite advanced level to me. Rob's explanation helped, and I am hoping anyone provides code segment, even pseudo code would help too. Thanks again!

0
Entering edit mode

The bwa paper includes pseudo-code for some steps, although probably not the particular step you are interested in.