Question: Book For String Matching Algos
gravatar for Abhi
7.9 years ago by
United States
Abhi1.5k wrote:

Hi All

I am routinely coming across challenges in string matching which makes me belv its time I need to refresh my string comparison algos specifically with DNA type strings.

Any good resource/book you can recommend ? I am looking for something which talks about the advance string matching stuff using compressed algos like BWT etc.

Best, -Abhi

next-gen sequencing • 2.4k views
ADD COMMENTlink written 7.9 years ago by Abhi1.5k
gravatar for Pierre Lindenbaum
7.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

alt text

Algorithms on strings M. Crochemore, C. Hancart and T. Lecroq Cambridge University Press

See also:

alt text

Algorithms on strings, trees, and sequences: computer science and computational biology

ADD COMMENTlink written 7.9 years ago by Pierre Lindenbaum121k

Gusfield Algorithms on strings, trees and sequences is by far the best one I have read on the subject. Thanks for the other recommendations. Will check'em out.

ADD REPLYlink written 7.9 years ago by Arun2.3k
gravatar for lh3
7.9 years ago by
United States
lh331k wrote:

The most classical book in this area is Dan Gusfield's Algorithms on strings trees and sequence. It is particularly famous for the comprehensive introduction to suffix trees and its various applications in biological sequence analysis. Another noteworthy data structure is Enhanced Suffix Array (ESA). It mimics most of the suffix tree operations in a memory efficient way. ESA is implemented in vmatch, seqan and a few others. ESA and suffix tree are closely related. Reading Gusfield's book helps to understand the ESA paper.

A less related field is compressed full-text indexing where the most outstanding result in my view is FM-index published in 2000. FM-index has a weak link to suffix trie/tree in that we can simulate a top-down traversal but not bottom-up traversal with FM-index. The key advantage of FM-index is that it typically has a much smaller memory footprint than ESA and suffix tree. This is why FM-index is more widely used for human sequence alignment although suffix tree and ESA are more versatile data structures in general.

However, I have not read any good books on FM-index. Most of books talking about FM-index, including the one suggested by GWW, are a little light. These books seldom cover more than the first FM-index paper (a more recent version is easier to read, though you may want to ignore the compression part). From the books, you may know how to do exact match with FM-index, but may not know how to write your own practical implementations. Furthermore, there have been quite a few advances in the past few years which help the development of the BWT-based short-read mappers/assemblers. For example, bwa and bwa-sw are based on a 2008 paper, SOAP2 and the published SGA based on a paper published in 2009 and bowtie uses results in a 2007 paper. Without these recent works in computer science, we may not have these BWT-based mappers/assemblers or may just have crippled tools. No books so far introduce these latest findings. You have to go through papers if you really want to implement something practical.

You actually do not need to have much knowledge on suffix tree and ESA to understand FM-index. You can start to study FM-index right away (I did that), though you may want to read Gusfield's masterpiece ultimately. Also, for a quick start, it is recommended to reuse others' code to construct FM-index/BWT (I did that, too). Constructing BWT is non-trivial, but of less research interest to biologists.

ADD COMMENTlink written 7.9 years ago by lh331k

@lh3 : Thanks for a very well written advice. Could you please tell where I could get the code snippet for the FM-index/BWT algo. I am asking without any google search coz I belv you would know it precisely.

ADD REPLYlink written 7.9 years ago by Abhi1.5k
gravatar for Gww
7.9 years ago by
Gww2.6k wrote:

For the burrows-wheeler transform the best book I could find is this one here, it covers suffix trees, suffix arrays, and the burrows-wheeler transform; since they are all intimately related. It also covers advanced topics such as the FM-index.

The Burrows-Wheeler Transform :: Data Compression, Suffix Arrays, and Pattern Matching

by Donald Adjeroh, Timothy Bell, and Amar Mukherjee

alt text

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Gww2.6k
gravatar for Eric Fournier
7.9 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote: has information about Bioinformatics books. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology is one of the books that is recommended. I remmber reading parts of it and finding it enlightening.

ADD COMMENTlink written 7.9 years ago by Eric Fournier1.4k
gravatar for Martijn Van Iersel
7.9 years ago by
Martijn Van Iersel560 wrote:

This website is probably a good place to start: It lists dozens of String-searching algorithms.

The book Algorithms by Robert Sedgewick is good, it contains a whole chapter on substring searching. There is also a website with excerpts and code.

alt text

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Martijn Van Iersel560
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1015 users visited in the last hour