I recently came across the book, “Domain-Specific Languages: Effective Modeling, Automation, and Reuse” by Andrzej Wąsowski and Thorsten Berger[1], and this provides a very useful overview of current standard practice. The chapter on code generation is of particular interest, as this is an area where I’ve struggled to find much in the literature. (The code generation chapter has 13 references, the second lowest of any chapter in the book. The average references per chapter is just over 43.)
The chapter starts off with the claim that writing an interpreter is “often the cheapest and the easiest way of implementing dynamic semantics.” I’m assuming that by “dynamic semantics” the authors are meaning something such as a macro language that can change the behaviour of a program, rather than the philosophical meaning, which relates to how natural languages meaning can vary with context. Given that assumption of meaning, an interpreter seems the obvious solution, but they also list reasons for avoiding interpreters such as speed and memory use. This is the first example of a recurring minor irritation in the chapter: a tendency to conflate specific use-cases and general principles. It becomes clear later that the authors view is the more general interpretation, that writing an interpreter for a DSL is often easier than writing a code generator. From my own experience, I can’t find any reason to disagree with them, but at the same time I feel this is a result of immature tools for code generation, rather than it genuinely being a harder problem.
Code generation is described as a “model to text” (M2T) transformation, and the chapter describes three approaches to this: using the visitor pattern to programmatically generate code based on the structure of the input; using a template based on the intended output; and a hybrid approach, mixing the other two. Examples of these are included, and rather than describing these I’ll pick out a few points that struck me as interesting. Although the description of the process as a model to text transformation is accurate, I feel that it also gives an impression of relative simplicity, like generating skeleton code from UML, rather than the full potential of transforming the abstract syntax tree generated from a specialised language into a general purpose language.
The chapter suggests that the first stage of developing a code generator should always be to write a “reference output implementation”. This aligns with my practice, though up until now I have been referring to these as “exemplars”.
The first example uses the visitor pattern and recursion. The example seems to have a lot of code, for very little output, however in a larger project, a lot of the code would be reused, so the growth of the generator would be very much slower than the growth of the output. The structure of the generator is also close to the structure of an interpreter. The authors note that memory management can be a challenge with this approach to code generation, as it involves creating and concatenating large numbers of strings. Languages such as Java that create a new object every time two strings are joined can consume memory faster than their garbage collectors free it. The solution suggested is to use a simple string tree class.
One thing that the authors suggest is using a pretty print library as part of the code generator. This makes a lot of sense, provided the pretty printer is sophisticated enough to be using representations of blocks, however that is very likely to be fairly language specific as well, so maybe not so good for a general-purpose code generator library.
The first template example uses Scala as an implementation language, which is a particularly good language for this purpose. Scala includes interpolated strings, that can have function calls as well as variables embedded in them. It also has a multiline string syntax that makes it rare to need escape characters, and a stripMargin
method that allows a character to mark the real start of indented lines. This is a huge aid to readability.
The sample above shows some of these features of Scala. This is a function exampleGen
that takes an object of type AnObject
as a parameter and returns a string. There is no explicit return statement – the last expression, in this case the string, is taken as the return value. The string is declared with three quotes to indicate that it is a multiline string, and these are precesed with an s to indicate that it is interpolated. The $ character is used to indicate interpolated variables, with { } optionally around them for readability, however the ${ ... }
syntax also allows expressions to be included.
Scala is not a language I’ve ever looked at before, however I think I should have a closer look to see if there are other concepts that could be useful for a template language.
An other template example uses the specialist Xtend/Xpand template language which extends Java. It looks conceptually similar to my older templates, other than being embedded in a conventional language, which I feel makes it less readable.
Following the examples there are some guidelines for implementing DSL code generators. Most of these are very sensible, although a few go back to the pattern of confusing the specific and the general. One thing they recommend (Guideline 9.4) is doing a binding time analysis. This is something I realise I have always done, but never really thought about as being a formal named stage of the process.
The final section of the chapter is on quality assurance and testing. The authors note that this is quite a challenging area, and although it is only a fairly short section, I think a proper discussion of this deserves a separate blog post.
[1] A. Wąsowski and T. Berger, Domain-Specific Languages: Effective Modeling, Automation, and Reuse, 1st ed. 2023. Cham: Springer International Publishing, 2023. doi: 10.1007/978-3-031-23669-3.