Writing source code for human readers

I’ve been trying to recreate Alex Kladov’s implementation of the Pratt parser algorithm, described in “Simple but Powerful Pratt Parsing“, in PHP, to make sure I fully understand it. Kladov’s implementation is in Rust. There are also translations to Python and TypeScript linked from his github archive. I set out trying to read the three implementations, in languages I’m not fluent in, rather than the description, while doing my own implementation. Perhaps that was a mistake, but it did get me thinking about how programming style affects readability, and how programs should be written if their purpose is to demonstrate an algorithm. Kladov’s blog provides a good, though wordy, description of the algorithm, and I expect the other two implementations were, like mine, written to help the authors fully understand it, so it is understandable that the code is not written for readability, but also frustrating.

In 1968 Edsger Dijkstra wrote a letter to the editor of Communications of the ACM, “go to statement considered harmful“^[¹^], which has been fairly influential, and many high level languages, including Rust, Python and TypeScript, follow his advice to not have a goto statement. Except, they sort-of do really…

A few years later, William A Wulf wrote a paper “A case against the GOTO“^[²^], which argued that there are some situations where goto is valid, but these should be supported with alternative language constructs.

The three languages I mentioned above have the break, and the continue statements for jumping from inside loops, which are really just special cases of goto under different names. TypeScript/JavaScript is particularly blatant, as it allows continue to jump to a flag, i.e. it is just goto renamed, and slightly constrained.

On the whole I’m more in agreement with Dijkstra than Wulf: I think both break (other than in case statements) and continue tend to lead to hard to read code, and are almost never needed. Modern editors make code with nested if statements much more readable, and the condition of an if gives a statement of intent that is missing with the pseudo gotos. Another use of break is to avoid multiple returns – there are times when that’s a good idea, but, in my opinion, there are a lot more occasions when returning as soon as the routine is finished leads to much clearer code. Anthony Steele’s blog post “The Single Return Law” discusses this.

All three of the examples I’ve been looking at make use of break to avoid multiple returns, and two also use continue, all in a very large loop, which makes reading the code, while writing an equivalent tricky.

Of course, other programmers may well disagree, and if the style they’re used to involves using break and continue to perform goto-like jumps to avoid a couple of extra ifs, and returns, maybe they find that more readable.

The three implementations of Pratt’s parser algorithm are also free of comments, though it could be argued that Kladov’s detailed description suffices.

The use of lots of formatted text describing adjacent pieces of code has some resemblance to Donald Knuth’s Literate Programming, an example of which is in one of Jon Bentley’s programming pearls, “A literate Program”^[³^]. (Note, in the PDF of this there is a translation of := (assignment equals) to ←, which makes the code hard to read for those familiar with Pascal.) However, Knuth’s literate programming created a document that could be processed to create TeX, and hence PDFs for printing, or source code, while this is effectively two separate documents with significant duplication.

In Code Complete, McConnell^[⁴^] has a chapter on “Steps in Building A Routine”, which suggests writing out the process in an English like language, “program design language” or PDL, and then writing the routine. The PDL should give the meaning rather than implementation detail of the code. I’ve found this a useful approach in the past when writing something complex, and the PDL then becomes rather effective comments.

In the end, rather than trying to debug my PHP, I wrote out a PDL like description of the Pratt algorithm, probably rather closer to pseudo-code than McConnell would approve of. This was an easier process than translating direct to PHP, and I was then able to retro-fit that to my PHP, and quickly spotted most of my errors.

Having does this, I realise my own expression algorithm is closer to Dijkstra’s shunting yard than Pratt’s recursive version. At the moment my algorithm and mini-grammar definition don’t support prefix and postfix operators, or the ternary operator ( a ? b : c ), both of which are supported in these examples. I’ll be adding prefix and postfix operators to both the grammar and the interpreter soon. The ternary operator (or rather potential family of operators,) will take a bit of thought. For now it can be included in the PEG grammar rules, but cannot be used inside another expression.

function expr_bp(min_bp) returns Atom or Cons
    variables:
	   token : a token from the lexer
	   op : a token renamed, once it’s certain it’s an operator…
	   lhs : left hand side of current expression, an Atom or Cons
	   rhs : right hand side of current expression, an Atom or Cons
	   mhs : ‘middle’ hand side, used for the ternary operator ( lhs ? mhs : rhs )
	   l_bp : left ‘binding power’ 
	   r_bp : right ‘binding power’


    token = read next token
	if token is an Atom
	   lhs = token
	else if token is ‘(‘
	    lhs = expr_bp(0)     
	    take next token and assert it is ‘)’
	else if token is another op // py 78, rust 9, ts, 71
	    op = token
	    r_bp = prefix_binding_power(op), python code doesn’t deal with error (non-prefix token) here. 
	    rhs = recursivly call expr_bp limited by r_bp
		lhs = Cons(tuple) (op, rhs)
	else
	   report an error and give up.
	
	loop
	    token = peek at next token in lexer.
		if token is EOF
		    break out of loop
		else if token is an operator                 //py 89, rust 20, ts 87 
		    op = token
        else
            report an error and give up.

        if op aka token has a postfix binding power		
		    l_bp = token’s postfix binding power
			if(l_bp < min_bp)
			    break out of the loop, ‘single return point rule’  (i.e. return lhs)
			
			consume next token from lexer, i.e. the one that was being used, so it’s gone
			if op is ‘[‘
			    rhs = the contents of [ ] got by recursively calling expr_bp
				lhs = a new Cons(tuple), consisting of ‘[‘ and rhs
				consume a token from lexer, asserting that it is ‘]’
			else
			    lhs = a new  Cons(tuple), consisting of op and a tuple of the previous value of lhs and null
				//py 106, rust 37, ts 107 
			
		    at this point rust and ts jump to next turn of loop, py uses  elif (like this)
		else if token has an infix binding power
		    l_bp, r_bp = the infix binding powers for left and right
			if l_bp is less than min_bp break out of the loop,  ‘single return point rule’ (i.e. return lhs)
			consume next token from lexer, you jumped the previous point where this was done
			
			if op is ‘?’ (Ternary operator needs special handling)
			    mhs = the ‘true’ result option got by recursively calling expr_bp
				consume a token from lexer, asserting that it is ‘:’
				rhs = the ‘false’ result option got by recursively calling expr_bp
				lhs = a new Cons(tuple) op, (lhs, mhs, rhs)
            else (all the simple two parameter ops)				
			    rhs = recursively call expr_bp limoted by r_bp, to get any higher binding power’s that need dealt with first 
				lhs = Cons(op, (lhs, rhs))
		else
            end the loop
     
    return lhs, which should be the expression, or the sub expression this call was for.

PDL/Pseudo-code for the main routing in Pratt’s algorithm

Dijkstra, Edsger W. “Letters to the editor: go to statement considered harmful.” Communications of the ACM 11.3 (1968): 147-148. ↩︎
Wulf, William A. “A case against the GOTO.” Proceedings of the ACM annual conference-Volume 2. 1972. ↩︎
Bentley, Jon, Don Knuth, and Doug McIlroy. “Programming pearls: A literate program.” Communications of the ACM 29.6 (1986): 471-483 ↩︎
McConnell, Steve. Code complete. Pearson Education, 2004. ↩︎

Niall B's blog

Writing source code for human readers

Leave a Reply Cancel reply