Adding named matches to parser rules

One of the nice features of my early regular expression based parsers was that it was easy to get data out by defining sub-matches in the rules. For example, to match the start of a class (using PHP preg_match syntax,) a rule might be defined as

‘/class\s+(?<name>[A-Za-z_]\w*)\s*{/’

Regular expression match of the start of a class, with named sub-match ‘name’.

This simple example only returns one named sub-match, the class name, however it is possible to return a large number of sub-matches. In a more eBNF like syntax, the same match could be defined as something like:

KWD_class IDENT ‘{‘

eBNF equivalent (where KWD_class and IDENT are defined by the lexer.)

As I want to use these eBNF like rules in a similar way to regular expressions, to create a sort of mix between a generated parser and a hand written recursive decent parser, returning sub-match information is essential, and named-matches make that easier to work with. For this reason I’ve added extra syntax to my dialect of eBNF. The PCRE syntax used in PHP didn’t fit, as it requires parentheses, which I’d rather only use when really needed for clarity. My solution id simply to prefix rules or groups that are to be sub-matches with a variable name followed by a colon.

KWD_class name:IDENT ‘{‘

The above eBNF, with a named sub-match.

If course, there is more information available than just the matched string, and in some cases I’d like to know what lexer rule was matched (e.g. integer or identifier), without having to re-check. For this reason the match data is slightly richer than that returned from a PCRE regular expression.

{
    _value: “class myClass {“,
    _rule: false,
    _errors: [],
    name: {
        _value: “myClass”,
        _rule: “IDENT”
    }
}

JSON match data for above rule run on string ‘class myClass {‘

A limitation of PCRE matches is that when a list is matched, only a single string can be returned, so depending on where the sub-match is, you will either get the whole list as a single string, or the last entry. To get round this I borrows a bit of syntax from PHP. In PHP when an array variable followed by [] is assigned a value, that value is appended as a new member of the array.

KWD_class name:IDENT (KWD_implements interface[]:IDENT(‘,’  interface[]:IDENT)*)? ‘{‘

List example rule, which matches a variable number of interface names.

class myClass implements iA, iB, iC {

Example input for list example rule.

{
    _value: “class myClass implements iA, iB, iC {“,
    _rule: false,
    _errors: [],
    name: {
        _value: “myClass”,
        _rule: “IDENT”
    },
    interface: [
        {
            _value: “iA”,
            _rule: “IDENT”
        }, 
        {
            _value: “iB”,
            _rule: “IDENT”
        }, 
        {
            _value: “iC”,
            _rule: “IDENT”
        }]
}

JSON representation of the match data for the list example above.

A further complexity, that regular expressions do not have, is that these eBNF rules can be, and indeed, in real examples almost always are, nested. This adds further complexity, but I think my solution works well enough. To demonstrate, I’ll split the above example into two rules.

identList : identifiers[]:IDENT(‘,’  identifiers[]:IDENT)*

KWD_class name:IDENT (KWD_implements interfaces:identList)? ‘{‘

A version of the above, with the interface names coming from a reusable identifier list rule.

In this example, the nested rule, identList is assigned to a named variable, and its match data will be held there.

{
    _value: “class myClass implements iA, iB, iC {“,
    _rule: false,
    _errors: [],
    name: {
        _value: “myClass”,
        _rule: “IDENT”
    },
    interfaces: {
        _value: “iA, iB, iC”,
        _rule: “identList”,
        _errors: [],
        identifiers: [
            {
                _value: “iA”,
                _rule: “IDENT”
            }, 
            {
                _value: “iB”,
                _rule: “IDENT”
            }, 
            {
                _value: “iC”,
                _rule: “IDENT”
            }]
    }
}

JSON output showing nested rule data.

If there had been no name assigned to the list of interface identifier names, the variable identifiers would just cascade down into the root, as shown below.

{
    _value: “class myClass implements iA, iB, iC {“,
    _rule: false,
    _errors: [],
    name: {
        _value: “myClass”,
        _rule: “IDENT”
    },
    identifiers: [
        {
            _value: “iA”,
            _rule: “IDENT”
        }, 
        {
            _value: “iB”,
            _rule: “IDENT”
        }, 
        {
            _value: “iC”,
            _rule: “IDENT”
        }]
}

JSON output showing nested rule data, but no nested rule variable.

I’m quite pleased with how well this works. This variant of eBNF, working with an interpreter, creating a form of PEG grammar for sections of code is working well as a way of creating simple, maintainable parsers fairly quickly. At the moment the actual parsers are written in PHP, but the intension is to have a little parser definition language that allows even easier design of parsers for other little domain specific languages, an of course, for defining itself.

Niall B's blog

Adding named matches to parser rules

Leave a Reply Cancel reply