7.4 years ago by
The most classical book in this area is Dan Gusfield's Algorithms on strings trees and sequence. It is particularly famous for the comprehensive introduction to suffix trees and its various applications in biological sequence analysis. Another noteworthy data structure is Enhanced Suffix Array (ESA). It mimics most of the suffix tree operations in a memory efficient way. ESA is implemented in vmatch, seqan and a few others. ESA and suffix tree are closely related. Reading Gusfield's book helps to understand the ESA paper.
A less related field is compressed full-text indexing where the most outstanding result in my view is FM-index published in 2000. FM-index has a weak link to suffix trie/tree in that we can simulate a top-down traversal but not bottom-up traversal with FM-index. The key advantage of FM-index is that it typically has a much smaller memory footprint than ESA and suffix tree. This is why FM-index is more widely used for human sequence alignment although suffix tree and ESA are more versatile data structures in general.
However, I have not read any good books on FM-index. Most of books talking about FM-index, including the one suggested by GWW, are a little light. These books seldom cover more than the first FM-index paper (a more recent version is easier to read, though you may want to ignore the compression part). From the books, you may know how to do exact match with FM-index, but may not know how to write your own practical implementations. Furthermore, there have been quite a few advances in the past few years which help the development of the BWT-based short-read mappers/assemblers. For example, bwa and bwa-sw are based on a 2008 paper, SOAP2 and the published SGA based on a paper published in 2009 and bowtie uses results in a 2007 paper. Without these recent works in computer science, we may not have these BWT-based mappers/assemblers or may just have crippled tools. No books so far introduce these latest findings. You have to go through papers if you really want to implement something practical.
You actually do not need to have much knowledge on suffix tree and ESA to understand FM-index. You can start to study FM-index right away (I did that), though you may want to read Gusfield's masterpiece ultimately. Also, for a quick start, it is recommended to reuse others' code to construct FM-index/BWT (I did that, too). Constructing BWT is non-trivial, but of less research interest to biologists.
7.4 years ago by
lh3 ♦ 31k