Question

Experience With Parser Generators For Common Formats

4

Entering edit mode

11.4 years ago

Fabian Bull ★ 1.3k

Bioinformatics analysis often deals with parsing file formats (e.g. GFF, VCF). Unfortunately there are many parsers out there, that only implement a subset of the specification, perform bad on large files or only work on some datasets.

Because parsing is a very old and well understood problem in computer science, I would like to solve this problem with a more formal approach:

Parser generators. The idea is to give the program a grammar (which is easy to develop for most bioinformatics formats) and get a parser as result.

My question: Has anybody experience with this approach? Did anyone use libraries like ANTLR or JavaCC successfully? If yes, did you check for performance problems on larger files?

parser • 5.5k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 11.4 years ago by Fabian Bull ★ 1.3k

score 3 · Answer 1 · 2013-03-10

Yacc et al are mainly useful for texts that do not fit the regular grammar. GFF/VCF/SAM can be parsed with regular expressions alone. More permissive grammars are not necessary.

I used to use flex/yacc in my early days in bioinformatics, but at the end of day, I think it is overkilling. EDIT: as to performance/flexibility, gcc switched from yacc to a hand-written C/C++ parser. If you really know what you are doing, hand-written parsers will be faster.

A little background on the theory of regex, yacc, context-free grammars, regex engine, DFA and their applications in computational biology. The following is poor-man's description. Not precise at all.

Formal grammar is the foundation of compilation theory in computer science and Chomsky hierarchy is the typical classification of formal grammars. In this hierarchy, regular grammar is the simplest but the least powerful grammar. Regex can be straightforwardly converted to NFA and then transformed to a DFA. Parsing is guaranteed to be done in linear time. A major limitation of regular grammar is that it ignores long-range correlation. For example, we cannot use regex to require matching unknown number of parentheses. Because of this limitation, the Newick format and HTML are not regular and cannot be parsed with a single regex.

Context-free grammar allows long-range interaction and is thus more powerful. According to wiki, it is "the theoretical basis for the syntax of most programming languages". However, this grammar cannot always be parsed in linear time and may require large memory to hold the entire parse tree. Parser generators usually implement a subset of this grammar that is simpler and more efficient. The LALR(1) grammar is such a subset, which yacc implements. IIRC, LALR(1) can be turned into a DFA. Parsing is achieved with a linear scan of the text with trivial memory footprint.

NFA and DFA are closely related to a Markov chain where each directed edge in the chain is associated with a symbol or token. Markov chain is somewhat the stochastic version of NFA. The stochastic version of context-free grammar (i.e. stochastic context-free grammar) has also been widely used for identifying RNA secondary structure and RNA alignment. It is also possible to describe some phylogenetic tree algorithms with the context-free grammar.

Now about regex engines. As is stated above, regex parsing can be done in linear time with the regex-NFA-DFA transformation. Implementing a DFA engine is non-trivial and frequently the limitation of regex is hurting. As a result, many popular regex engines are implemented with backtracking. A backtracking regex engine is easier to implement (requiring to up to 10-fold less code than a DFA regex engine), more flexible and is usually faster for simple patterns. The popular Perl regex and PCRE are using such an algorithm. However, a backtracking regex engine does not guarantee linear-time parsing. When you use a bad regex, it can be tens to hundreds of times slower than a DFA engine (for more discussions, see Russ Cox's article on PCRE vs. RE2). In my view, the worst-time time complexity is a far more significant concern than simplicity. That is why I said in the comment that we should drop Perl/PCRE for good: Perl/PCRE populated a bad algorithm for overcomplicated augment to regex, which misled programmers to think regex can be inefficient.

Back to the OP's question, most TAB-delimited bioinformatic format are regular. Going into more powerful but more complex grammars is unnecessary and inefficient. In my opinion, we should avoid non-regular formats unless necessary.

score 3 · Answer 2 · 2013-03-10

3

Entering edit mode

11.4 years ago

Pierre Lindenbaum 163k

my javaCC parsers ( I use javacc rather than antlr because the final *.jar doesn't need any dependencies )

my lex/yacc aka flex/bison parsers:

and for XML, you can just use java+xjc

ADD COMMENT • link 11.4 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

In treebest, the newick parser is written with flex/yacc, but later I realized that it was a mistake. I have written two other NH parsers in perl with regex (PS: of course not parse the full tree with one regex as NH is not regular) and in C with basic string functions. They are both simpler and faster than the treebest parser.

ADD REPLY • link 11.4 years ago by lh3 33k

score 2 · Answer 3 · 2013-03-10

One of the problems is that most often the dataformat is badly specified and with enough exceptions that a fully regular grammar can't capture all you need. As an example see the swiss-prot format or any fasta file format (at least what is in the header etc...). Which is why grammars fit awkwardly, besides that few bio-informaticians were trained in creating or using them.

score 1 · Answer 4 · 2013-03-10

This is a really good point, surprisingly few bioinformatics tools implement fast parsers via a parser generators. I am not sure why that is, I suppose people want to avoid adding another dependency when the file format appears to be simple.

Occasionally a number of solutions are proposed with flex/yacc/bison for example: http://www.biostars.org/search/?q=flex+OR+yacc+OR+bison&t=all&submit=

score 1 · Answer 5 · 2013-03-10

1

Entering edit mode

11.4 years ago

Alex Reynolds 35k

One problem with writing parsers is that people don't follow specifications, relying on the fact that the data are human-readable. Some research labs or projects will persist with the incorrect use of a format, because they are familiar with it. Grammars can't easily work around those problems.

ADD COMMENT • link 11.4 years ago by Alex Reynolds 35k

0

Entering edit mode

But if you writing data not conforming the specification, nothing can work well. The questions becomes: Which approach can be easier modified? And that would probably be 'handwritten'.

ADD REPLY • link 11.4 years ago by Fabian Bull ★ 1.3k

score 1 · Answer 6 · 2013-03-11

I use a packrat parser framework called PetitParser. It is recursive descent, supports memoization for speed, is 100% unit tested and already available in BioSmalltalk, a bioinformatics library I am writing for the Smalltalk programming system. The parser library has been applied to write parsers for both formal and natural languages due to its flexibility (grammar reuse, composition, transformation, pure Smalltalk syntax, etc). If you already have a specification (i.e. BNF) it is pretty straightforward to generate a parser. I have never experimented performance problems. For example, for matching in Smalltalk you just evaluate:

#dnaSequence asParser matches: 'TCGTACGA'.

and for parsing:

#dnaSequence asParser parse: 'TCGTACGA'.

which simply translates to something like:

(PPPredicateObjectParser anyOf: 'GATCNgatcn') plus end matches: 'actg545'.    " false "

(PPPredicateObjectParser anyOf: 'GATCNgatcn') plus end matches: 'TCGTACGA'. " true "

Aside from that idiom, my experience is that I have suffered learning ANTLR and the Spirit parser framework, although not for doing bioinformatics. Spirit it is an "object-oriented" parser library (that is, given the limitations of C++) but is the closest thing to PetitParser I have found. But they are good choices for those who do not pursue more than satisfying the "language + library = new program" equation, and there are lot of nice intentions in the people behind them.