Writing source code for human readers

I’ve been trying to recreate Alex Kladov’s implementation of the Pratt parser algorithm, described in “Simple but Powerful Pratt Parsing“, in PHP, to make sure I fully understand it. Kladov’s implementation is in Rust. There are also translations to Python and TypeScript linked from his github archive. I set out trying to read the three implementations, in languages I’m not fluent in, rather than the description, while doing my own implementation. Perhaps that was a mistake, but it did get me thinking about how programming style affects readability, and how programs should be written if their purpose is to demonstrate an algorithm. Kladov’s blog provides a good, though wordy, description of the algorithm, and I expect the other two implementations were, like mine, written to help the authors fully understand it, so it is understandable that the code is not written for readability, but also frustrating.

In 1968 Edsger Dijkstra wrote a letter to the editor of Communications of the ACM, “go to statement considered harmful[1], which has been fairly influential, and many high level languages, including Rust, Python and TypeScript, follow his advice to not have a goto statement. Except, they sort-of do really…

A few years later, William A Wulf wrote a paper “A case against the GOTO[2], which argued that there are some situations where goto is valid, but these should be supported with alternative language constructs.

The three languages I mentioned above have the break, and the continue statements for jumping from inside loops, which are really just special cases of goto under different names. TypeScript/JavaScript is particularly blatant, as it allows continue to jump to a flag, i.e. it is just goto renamed, and slightly constrained.

On the whole I’m more in agreement with Dijkstra than Wulf: I think both break (other than in case statements) and continue tend to lead to hard to read code, and are almost never needed. Modern editors make code with nested if statements much more readable, and the condition of an if gives a statement of intent that is missing with the pseudo gotos. Another use of break is to avoid multiple returns – there are times when that’s a good idea, but, in my opinion, there are a lot more occasions when returning as soon as the routine is finished leads to much clearer code. Anthony Steele’s blog post “The Single Return Law” discusses this.

All three of the examples I’ve been looking at make use of break to avoid multiple returns, and two also use continue, all in a very large loop, which makes reading the code, while writing an equivalent tricky.

Of course, other programmers may well disagree, and if the style they’re used to involves using break and continue to perform goto-like jumps to avoid a couple of extra ifs, and returns, maybe they find that more readable.

The three implementations of Pratt’s parser algorithm are also free of comments, though it could be argued that Kladov’s detailed description suffices.

The use of lots of formatted text describing adjacent pieces of code has some resemblance to Donald Knuth’s Literate Programming, an example of which is in one of Jon Bentley’s programming pearls, “A literate Program”[3]. (Note, in the PDF of this there is a translation of := (assignment equals) to ←, which makes the code hard to read for those familiar with Pascal.) However, Knuth’s literate programming created a document that could be processed to create TeX, and hence PDFs for printing, or source code, while this is effectively two separate documents with significant duplication.

In Code Complete, McConnell[4] has a chapter on “Steps in Building A Routine”, which suggests writing out the process in an English like language, “program design language” or PDL, and then writing the routine. The PDL should give the meaning rather than implementation detail of the code. I’ve found this a useful approach in the past when writing something complex, and the PDL then becomes rather effective comments.

In the end, rather than trying to debug my PHP, I wrote out a PDL like description of the Pratt algorithm, probably rather closer to pseudo-code than McConnell would approve of. This was an easier process than translating direct to PHP, and I was then able to retro-fit that to my PHP, and quickly spotted most of my errors.

Having does this, I realise my own expression algorithm is closer to Dijkstra’s shunting yard than Pratt’s recursive version. At the moment my algorithm and mini-grammar definition don’t support prefix and postfix operators, or the ternary operator ( a ? b : c ), both of which are supported in these examples. I’ll be adding prefix and postfix operators to both the grammar and the interpreter soon. The ternary operator (or rather potential family of operators,) will take a bit of thought. For now it can be included in the PEG grammar rules, but cannot be used inside another expression.

function expr_bp(min_bp) returns Atom or Cons variables: token : a token from the lexer op : a token renamed, once it’s certain it’s an operator… lhs : left hand side of current expression, an Atom or Cons rhs : right hand side of current expression, an Atom or Cons mhs : ‘middle’ hand side, used for the ternary operator ( lhs ? mhs : rhs ) l_bp : left ‘binding power’ r_bp : right ‘binding power’ token = read next token if token is an Atom lhs = token else if token is ‘(‘ lhs = expr_bp(0) take next token and assert it is ‘)’ else if token is another op // py 78, rust 9, ts, 71 op = token r_bp = prefix_binding_power(op), python code doesn’t deal with error (non-prefix token) here. rhs = recursivly call expr_bp limited by r_bp lhs = Cons(tuple) (op, rhs) else report an error and give up. loop token = peek at next token in lexer. if token is EOF break out of loop else if token is an operator //py 89, rust 20, ts 87 op = token else report an error and give up. if op aka token has a postfix binding power l_bp = token’s postfix binding power if(l_bp < min_bp) break out of the loop, ‘single return point rule’ (i.e. return lhs) consume next token from lexer, i.e. the one that was being used, so it’s gone if op is ‘[‘ rhs = the contents of [ ] got by recursively calling expr_bp lhs = a new Cons(tuple), consisting of ‘[‘ and rhs consume a token from lexer, asserting that it is ‘]’ else lhs = a new Cons(tuple), consisting of op and a tuple of the previous value of lhs and null //py 106, rust 37, ts 107 at this point rust and ts jump to next turn of loop, py uses elif (like this) else if token has an infix binding power l_bp, r_bp = the infix binding powers for left and right if l_bp is less than min_bp break out of the loop, ‘single return point rule’ (i.e. return lhs) consume next token from lexer, you jumped the previous point where this was done if op is ‘?’ (Ternary operator needs special handling) mhs = the ‘true’ result option got by recursively calling expr_bp consume a token from lexer, asserting that it is ‘:’ rhs = the ‘false’ result option got by recursively calling expr_bp lhs = a new Cons(tuple) op, (lhs, mhs, rhs) else (all the simple two parameter ops) rhs = recursively call expr_bp limoted by r_bp, to get any higher binding power’s that need dealt with first lhs = Cons(op, (lhs, rhs)) else end the loop return lhs, which should be the expression, or the sub expression this call was for.
PDL/Pseudo-code for the main routing in Pratt’s algorithm
  1. Dijkstra, Edsger W. “Letters to the editor: go to statement considered harmful.” Communications of the ACM 11.3 (1968): 147-148. ↩︎
  2. Wulf, William A. “A case against the GOTO.” Proceedings of the ACM annual conference-Volume 2. 1972. ↩︎
  3. Bentley, Jon, Don Knuth, and Doug McIlroy. “Programming pearls: A literate program.” Communications of the ACM 29.6 (1986): 471-483 ↩︎
  4. McConnell, Steve. Code complete. Pearson Education, 2004. ↩︎

Leave a Reply

Your email address will not be published. Required fields are marked *