Experience With Parser Generators For Common Formats
6
4
Entering edit mode
10.0 years ago
Fabian Bull ★ 1.3k

Bioinformatics analysis often deals with parsing file formats (e.g. GFF, VCF). Unfortunately there are many parsers out there, that only implement a subset of the specification, perform bad on large files or only work on some datasets.

Because parsing is a very old and well understood problem in computer science, I would like to solve this problem with a more formal approach: Parser generators . The idea is to give the program a grammar (which is easy to develop for most bioinformatics formats) and get a parser as result.

My question: Has anybody experience with this approach? Did anyone use libraries like ANTLR or JavaCC successfully? If yes, did you check for performance problems on larger files?

parser file • 4.7k views
3
Entering edit mode
10.0 years ago
lh3 33k

Yacc et al are mainly useful for texts that do not fit the regular grammar. GFF/VCF/SAM can be parsed with regular expressions alone. More permissive grammars are not necessary.

I used to use flex/yacc in my early days in bioinformatics, but at the end of day, I think it is overkilling. EDIT: as to performance/flexibility, gcc switched from yacc to a hand-written C/C++ parser. If you really know what you are doing, hand-written parsers will be faster.

A little background on the theory of regex, yacc, context-free grammars, regex engine, DFA and their applications in computational biology. The following is poor-man's description. Not precise at all.

Formal grammar is the foundation of compilation theory in computer science and Chomsky hierarchy is the typical classification of formal grammars. In this hierarchy, regular grammar is the simplest but the least powerful grammar. Regex can be straightforwardly converted to NFA and then transformed to a DFA. Parsing is guaranteed to be done in linear time. A major limitation of regular grammar is that it ignores long-range correlation. For example, we cannot use regex to require matching unknown number of parentheses. Because of this limitation, the Newick format and HTML are not regular and cannot be parsed with a single regex.

Context-free grammar allows long-range interaction and is thus more powerful. According to wiki, it is "the theoretical basis for the syntax of most programming languages". However, this grammar cannot always be parsed in linear time and may require large memory to hold the entire parse tree. Parser generators usually implement a subset of this grammar that is simpler and more efficient. The LALR(1) grammar is such a subset, which yacc implements. IIRC, LALR(1) can be turned into a DFA. Parsing is achieved with a linear scan of the text with trivial memory footprint.

NFA and DFA are closely related to a Markov chain where each directed edge in the chain is associated with a symbol or token. Markov chain is somewhat the stochastic version of NFA. The stochastic version of context-free grammar (i.e. stochastic context-free grammar) has also been widely used for identifying RNA secondary structure and RNA alignment. It is also possible to describe some phylogenetic tree algorithms with the context-free grammar.

Now about regex engines. As is stated above, regex parsing can be done in linear time with the regex-NFA-DFA transformation. Implementing a DFA engine is non-trivial and frequently the limitation of regex is hurting. As a result, many popular regex engines are implemented with backtracking. A backtracking regex engine is easier to implement (requiring to up to 10-fold less code than a DFA regex engine), more flexible and is usually faster for simple patterns. The popular Perl regex and PCRE are using such an algorithm. However, a backtracking regex engine does not guarantee linear-time parsing. When you use a bad regex, it can be tens to hundreds of times slower than a DFA engine (for more discussions, see Russ Cox's article on PCRE vs. RE2). In my view, the worst-time time complexity is a far more significant concern than simplicity. That is why I said in the comment that we should drop Perl/PCRE for good: Perl/PCRE populated a bad algorithm for overcomplicated augment to regex, which misled programmers to think regex can be inefficient.

Back to the OP's question, most TAB-delimited bioinformatic format are regular. Going into more powerful but more complex grammars is unnecessary and inefficient. In my opinion, we should avoid non-regular formats unless necessary.

1
Entering edit mode

If you know what you are doing, handwritten code is always faster. The questions are: Do we know what we are doing (most often we are no experts of the problem)? Is the performance gain worth the effort?

0
Entering edit mode

To clarify, by "hand-written" I meant to use either regex or basic string routines (just do not use yacc). With regex, you need less effort for better performance. With basic string routines, you do need more effort, but you can save hours when parsing huge files. After all, if you do not know something, why not learn it? To me, yacc is harder to learn than regex and proper use of string routines?

1
Entering edit mode

I do agree with simple parsers (VCF, json, newick, etc.. ) but when I've have to handle ambiguity, states, complexity, lookaheads, etc... I still prefer to use a code generator :-) . It's easier to maintain the code with lex/javacc too.

1
Entering edit mode

regexp matching can be deceiving - I remember once writing a seemingly trivial HTML regex parser that would occasionally run stupendously slowly, like orders of magnitude slower than expected even though the input was just slightly different, may be just another wrapping tag or something similar. Back then I did not know about the worst case scenarios for backtracking type of RE engines and I spent a day or more debugging and it was eye opening to understand what happened. Of course when someone knows about these it is easy to avoid/recognize and for simple rules one would not even need to do that in the first place.

Moreover I will admit even today I have no idea that on how to optimize them, what makes a faster or slower construct and I will also admit to having a hard time understanding regular expressions that other people write. It is true that I don't make use of them that frequently and my eyes perhaps are not trained to read them. Just some personal observations on regular expressions.

2
Entering edit mode

Chuck Norris can parse HTML with regex http://stackoverflow.com/a/1732454/58082

0
Entering edit mode

On regex, I think we should abandon PCRE. It ruins the reputation of regex, IMO. RE2 in C++ is a much better library. In C, there is regexp9 from plan9 (you can find the regexp9.{h,c} somewhere in my github). With a proper regex engine, you do not need to worry too much about inefficient regex.

3
Entering edit mode
10.0 years ago

my javaCC parsers ( I use javacc rather than antlr because the final *.jar doesn't need any dependencies )

my lex/yacc aka flex/bison parsers:

and for XML, you can just use java+xjc

0
Entering edit mode

In treebest, the newick parser is written with flex/yacc, but later I realized that it was a mistake. I have written two other NH parsers in perl with regex (PS: of course not parse the full tree with one regex as NH is not regular) and in C with basic string functions. They are both simpler and faster than the treebest parser.

2
Entering edit mode
10.0 years ago
Jerven ▴ 650

One of the problems is that most often the dataformat is badly specified and with enough exceptions that a fully regular grammar can't capture all you need. As an example see the swiss-prot format or any fasta file format (at least what is in the header etc...). Which is why grammars fit awkwardly, besides that few bio-informaticians were trained in creating or using them.

1
Entering edit mode
10.0 years ago

This is a really good point, surprisingly few bioinformatics tools implement fast parsers via a parser generators. I am not sure why that is, I suppose people want to avoid adding another dependency when the file format appears to be simple.

Occasionally a number of solutions are proposed with flex/yacc/bison for example: http://www.biostars.org/search/?q=flex+OR+yacc+OR+bison&t=all&submit=

1
Entering edit mode
10.0 years ago

One problem with writing parsers is that people don't follow specifications, relying on the fact that the data are human-readable. Some research labs or projects will persist with the incorrect use of a format, because they are familiar with it. Grammars can't easily work around those problems.

0
Entering edit mode

But if you writing data not conforming the specification, nothing can work well. The questions becomes: Which approach can be easier modified? And that would probably be 'handwritten'.

1
Entering edit mode
10.0 years ago
Hernán ▴ 220

I use a packrat parser framework called PetitParser. It is recursive descent, supports memoization for speed, is 100% unit tested and already available in BioSmalltalk, a bioinformatics library I am writing for the Smalltalk programming system. The parser library has been applied to write parsers for both formal and natural languages due to its flexibility (grammar reuse, composition, transformation, pure Smalltalk syntax, etc). If you already have a specification (i.e. BNF) it is pretty straightforward to generate a parser. I have never experimented performance problems. For example, for matching in Smalltalk you just evaluate:

#dnaSequence asParser matches: 'TCGTACGA'.


and for parsing:

#dnaSequence asParser parse: 'TCGTACGA'.


which simply translates to something like:

(PPPredicateObjectParser anyOf: 'GATCNgatcn') plus end matches: 'actg545'.    " false "

(PPPredicateObjectParser anyOf: 'GATCNgatcn') plus end matches: 'TCGTACGA'. " true "


Aside from that idiom, my experience is that I have suffered learning ANTLR and the Spirit parser framework, although not for doing bioinformatics. Spirit it is an "object-oriented" parser library (that is, given the limitations of C++) but is the closest thing to PetitParser I have found. But they are good choices for those who do not pursue more than satisfying the "language + library = new program" equation, and there are lot of nice intentions in the people behind them.

0
Entering edit mode

For regexps, I should definitely recommend RE2C. It is fast, one of the fastest parser generators for RE known (as fast as or faster then handwritten parser, for it generates direct-executable, not table-based parsers) and very flexible (it only takes some time to learn how it works).