Looking at Guido van Rossum’s PEG blogs (part 1)

a semi-abstract painting representing source code being processed into a syntax tree.

A few weeks ago my supervisor Jeremy, suggested that I should read through Guido van Rossum’s series of blog posts about PEG parsers. I’ve only just got round to starting reading them properly, but I think they’re going to be very useful for me. These blogs were written at the start of the process of rewriting the Python parser, and so are focused on addressing real problems. They are also written in a very clear style, that makes understanding the concepts being described much easier.

By reading this series, I hope to get a better knowledge of where my parsers sit within the taxonomy of parsers, and also get a clearer image of what I’ve done well, and what could be improved.

Van Rossum starts the blog series with an explanation of why he started looking at PEG parsing: the original Python parser used a variant of the LL(1) algorithm, and was generated from custom EBNF notation by a custom generator, pgen. Two of the limitations of this were the single token look-ahead of LL(1) and a lack of support for left recursive rules. To illustrate these, and also for later discussion, he uses this toy language grammar:

statement: assignment | expr | if_statement
expr: expr ‘+’ term | expr ‘-‘ term | term
term: term ‘*’ atom | term ‘/’ atom | atom
atom: NAME | NUMBER | ‘(‘ expr ‘)’
assignment: target ‘=’ expr
target: NAME
if_statement: ‘if’ expr ‘:’ statement

Fig 1. Gudio van Rossum’s toy language grammar

Left recursive rules are rules that have recursion as their first term, e.g. term: term '*' atom | term '/' atom | atom. The solution to this is to rearrange the rule to a form that avoids left recursion, and uses a multiplier instead, e.g. term: atom ('*' term|'/' term)*. Left recursion is not supported in any regular expression language that I know of, and in part two of the series, van Rossum mentions that PEG grammars are usually used to generate recursive decent parsers, which do not (yet) support left recursive grammars, however possibly van Rossum’s point here is that he’d already used slightly PEG like syntax to simplify pgen. Like pgen and regular expressions, my parser rules don’t support left recursion. The same solution van Rossum used is available, but I’ve also largely avoided the issue by using a Pratt parser for expressions. The look-ahead limitation vanishes with PEG grammars, as they have whatever lookahead is required.

In the form of a definition for my parser, the toy example is slightly simplified, and no longer has left recursion, due to expressions using my simple Pratt parser definition, rather than an EBNF like one. The rest is very similar, though I’ve used my lexer’s default names for tokens, IDENT, INT and FLOAT, rather than NAME and NUMBER.

statement: assignment | expr | if_statement
expr: @expression IDENT, INT, FLOAT; (‘+’, ‘-‘), (‘*’, ‘/’)
assignment: target ‘=’ expr
target: IDENT
if_statement: ‘if’ expr ‘:’ statement

Fig 2. Van Rossum’s toy language recast into my grammar

In the second blog post van Rossum starts discussing the implementation of a PEG parser. This is where there is a major difference between our approaches, due to our differing end targets. The Python parser needs to be fast, as it is run at the start of any Python program execution, but is also very stable, only being updated when the python language changes. My parsers are intended to only be used by developers, and with small programs in small languages, so speed is not an issue, but the assumption is that these little languages will me modified to fit different requirements. So, van Rossum assumes a rule is the definition to be used to generate code, whereas I take a rule as the input for an interpreter. One similarity is that we both have opted to have separate lexers, while most modern parser generators, e.g. Antlr, take the character as the basic token. Van Rossum’s reason was the complexity of tokenising Python, a language where white-space has meaning. I had a few reasons:

I felt that the grammars would be more readable, and therefore maintainable if lexing was defined separately.
I wanted to allow the lexer definition to generate a syntax highlighter, that could be used directly.
I felt that defining lexing in a grammar involved a lot of boilerplate code, that would be better used in a reusable form, and ideally in a pick-and-mix form.
I wanted to support Python style languages in a way that was easy to reuse, and so wanted structure defining white-space rules to be available in a predefined form.

In the blog post, van Rossum shows pseudo-code for a recursive decent parser that could be generated from a PEG grammar for his toy language. One thing he discusses is how this interfaces with the tokeniser or lexer. He makes the point that if the tokeniser fails at a point well after the position where the parser would fail, if tokenising is fully carried out before parsing, this might result in confusing errors. This isn’t something I thought of, so at the moment my lexer runs first, and the parser is only run if that is successful. I maybe should rethink this.

He goes on to discuss backtracking – his approach to tokenising adds a small extra complexity to this, but that is easily solved by having an array of tokens in parallel with the on demand tokeniser. Pgen created a ‘parse tree’ which was then processed to generate an AST. The new PEG approach has methods returning nodes to build an AST directly. Mine approach is more inspired by regular expressions, with rules returning a match data structure, which is similar to, but not the same as as AST node, and requires less later processing.

In the third post, van Rossum starts describing how to generate the parser from a grammar. At some point in the future I might experiment with this, but for now I’m sticking with my interpreted alternative solution.

Niall B's blog

Looking at Guido van Rossum’s PEG blogs (part 1)

Leave a Reply Cancel reply