Evolving a parser design strategy and tools

(For some background/context to this post see my Generative Programming page.)

A slightly unusual aspect of my parser authoring tools is that there are three different levels of design involved. The classic parser writing tools, Lex and Yacc, separate to the process into two levels: the lexer, which converts the input into tokens, roughly equating to converting characters to words and punctuation in normal reading, and then the parser, which recognises the structure, the sentences and paragraphs. Some other tools, for example ANTLR, merge the two parts into a single level. My tools however have three levels, for now at least. The reason for this is the way they evolved, but I think there is also an advantage to it, as it works well with the way I, and I assume others, read and think about source code.

My early parsers (other than the much earlier attempt at a BASIC interpreter mentioned in my previous post) were designed for fairly trivial languages. The first one made use of PHP’s inbuilt tokenizer to lex the input, and then passed the list of tokens to a simple hand written parser.

There was a substantial gap before my next attempt to create a parser, and the next one used regular expressions to create a more optimised lexer. Over a number of iterations I gradually increased the complexity of the regular expressions, so that rather than matching tokens they would match common phrases, putting variable values into named sub-matches. I also created a tool, and indeed a little language, for defining these lexers. As my lexers got more powerful, my handwritten parsers became simpler, and eventually it became clear that just a tiny bit of code after each successful match was needed to create a parser. The obvious solution seems to be to just extend my lexer definition language to support the limited amount of further functionality required.

However, although this worked quite well up to a point it had one major flaw, as my regular expressions became more complex, reused sections became more common. If one of these sections had to change, for example if I had to change the subexpression that would match a variable name, , that could include nested arrays, that became really unwieldy. It became clear that I needed much more readable syntax for my parser rules, and more importantly, the ability to include rules within other rules. The overall structure of the parsers however, with rules matching phrases, followed by very simple code to build the data representation of the code and change state, worked very well. It also created parsers that were very readable other than the over complex regular expressions, and therefore maintainable, and also easy to extend.

My next iteration, which I see as the fourth-generation in my development of code generation tools, returned to having a separate lexer feeding tokens to simple parser rules. The lexer is simple – the language writer just selects which common rules are needed for their language, and provides lists of keywords and operators. In most cases this is sufficient, but occasionally custom regular expressions are also needed, but these are much simpler than the ones that were being used in the previous generation. The parser rules are sort of a mix between conventional eBNF notation and regular expressions. Like eBNF they consist of a mix of terminal tokens and other rules, with quantifiers and, where appropriate, grouped into alternative choices, but like regular expressions (and PEG parsers) alternative choices have presentence, and also like regular expressions, there are named sub-matches.

Initially my intention was to use the parser rules very much like the regular expressions in my previous generation of parsers, so there would be a lexer to create an initial list of tokens, and then rules collected into a number of states, each of which would be followed by a small amount of code to be run it when it matched. A sort of halfway house between a conventional machine generated parser, and a handwritten recursive descent parser, with many of the advantages of both. As I created the tool to help me write these rules, it became clear that for test purposes I needed something slightly more. A final stage with very simple rules that would provide the structure of the eventual parser, so that I could see that my set of rules was sufficiently complete to parse my examples. Initially I intended this just to be a throwaway test system, but when it came to create actual parsers, I found it easiest just to generate code from this to provide the structure. At the moment it is a fairly “Heath Robinson” setup, and the final parsers are created in PHP, but my intention is to create a proper interpreted little domain specific language (as the previous generation had) to create complete parsers for generative programming tools.

This is the current state, and why my parser development has ended up with three levels. The rules for the final level are much simpler than the eBNF like parser rules, but are collected into states, and have the extra feature of being able to trigger a change of state. They are, in effect, the structure of a recursive descent parser, with each state becoming a method. While the other two levels of my development are very clearly part of the final parser, this third stage is really just an outline that can be used to generate the structure. I think once an initial parser has been developed this bit should really be be thrown away, and any further modifications at this level will be made directly in the code.

The third iteration of my parser development process was fantastic for being able to update and merge languages, and I need to do some work to make sure that this advantage carries through to the fourth (and probably final) iteration.

Niall B's blog

Evolving a parser design strategy and tools

Leave a Reply Cancel reply