Turn your scripting language into a code generator

Abstract:

Generating code is easy: Just use the print/write/puts statement of any programming language to send the target code to an output file. But as the code generator grows, readability suffers. Escape characters start taking over, and it becomes harder to see the difference between the target code and the generator code.

This article shows you how to make this easier by injecting an additional step: first generate the script that contains the print/write/puts statements, then run that script to produce the final output. This technique cleanly separates the target code (the final output) from the generator code, and it has some other benefits too.

Code generation, the traditional way

Let's begin with one of the most common preprocessors: CPP. It "generates code" by means of statements such as #ifdef and #include. This allows C and C++ developers to configure their code for various platforms and environments.

But CPP is not very powerful. It only has if-then-else control flow. It does not support loops, file I/O, and other advanced computation. It serves its purpose as a simple preprocessor, but cannot be used for advanced code generation. We can do much better than that.

So you bring out your favorite scripting language. Sure enough, it has loops, file I/O and all the other stuff we need. In addition, it has a 'puts' or 'print' or 'writeLine' statement which sends a string to an output file. That's how scripting languages have generated code (or other structured text output) for ages.

Here's how you could do it in Tcl:

 1 foreach animal { Bear Cat Dog } {
 2    puts "class $animal : public Animal \{"
 3    puts "public:"
 4    puts "   $animal () \{"
 5    puts "      printf(\"Creating a new \\\"$animal\\\".\\n\");"
 6    puts "   \}"
 7    puts "   ~$animal ();"
 8    puts "\}"
 9    puts ""
10 }

You can do similar stuff in Python, Ruby, Lua and dozens of other languages.

This is already much more powerful than using CPP, because we now have a complete programming language available to generate code. But in spite of its power, this approach has too many setbacks. You have to write "puts" all over the place, and you need to juggle a lot of backslashes to escape special characters. "Cumbersome" is an understatement: this code is ugly and hard to maintain. I probably don't have to tell you what happens when you forget a single backslash or quote somewhere.

The process of adding quotes, backslashes and "puts" is a drag. I call this the decoration of the target code: Instead of writing your target code the way it will look in the final output, you need to "decorate" it with additional stuff. The target code is hidden inside a lot of decoration that you frankly don't want to see.

So here's an idea: Why not write a little tool that automates the decoration for us?

Automatic decoration

My first attempt at such a tool was called g2pp. The name stands for something like "PreProcessor that Generates code in 2 steps". The first step produces the script with all the escape characters that we wrote by hand earlier; the second step executes this script to produce the final output.

The main "innovation" of this tool is that it operates in 2 different modes, rather like the 'vi' editor:

In copy mode, we write statements in our scripting language just like before. The tool copies these statements verbatim into the intermediate script. The statements are written in the script's language, so they require no decoration.
In decoration mode, we write the final output exactly as we want it to look, without escape characters or any other forms of decoration. The tool adds the decoration automatically before sending this stuff to the intermediate script.

The g2pp tool was written specifically to use Tcl as the preprocessing language, but you can apply the same idea to other languages too. The output can be anything: C++, XML, or unformatted text. Paulo Ferreira, an engineer in Portugal, even uses g2pp to generate VHDL.

Here's a simple example of the tool's input:

example_01/animals.in

 1 foreach animal { Bear Cat Dog } {@
 2 class $(animal) : public Animal {
 3 public:
 4    $(animal) () {
 5       printf("Creating a new \"$(animal)\".\n");
 6    }
 7    ~$(animal) ();
 8 }
 9 @}

We start in copy mode, where we write a normal Tcl script. The 'foreach' loop is exactly the same Tcl code that we wrote by hand before. To produce output, you do not invoke "puts", but instead you switch to decoration mode using the '@' sign. Everything you write in decoration mode, is automatically decorated by the tool. Values can be spliced into the output using the $(...) syntax. You stay in decoration mode until the next '@' sign, which takes you back to copy mode.

In this example, you could think of "copy mode" as "tcl mode", and "decoration mode" as "C++ mode".

When I feed the above snippet of code to g2pp, it produces the following Tcl script for me:

example_01/animals.tcl

 1 foreach animal { Bear Cat Dog } {
 2 puts "
 3 class ${animal} : public Animal \{
 4 public:
 5    ${animal} () \{
 6       printf(\"Creating a new \\\"${animal}\\\".\\n\");
 7    \}
 8    ~${animal} ();
 9 \}
10 "
11 }

You can see that the preprocessor basically copied the foreach loop verbatim from the original input (because it was written in copy mode). The body of the loop is now one large call to puts, produced by the '@' signs in the input. The escape characters, as you can verify, are all nicely in place. The variable values are now spliced with Tcl's syntax: ${...}. The string literal we send to 'puts' is spread over multiple lines, but we could easily rewrite the tool to produce separate lines, each with its own 'puts' in front.

That's it. There really isn't anything to it, is there? In fact, you can think of the g2 "language" as plain Tcl with only one additional "command": A very strange command called '@', which behaves like a rather intelligent replacement for "puts".

When we run this intermediate Tcl script, it produces the following output:

example_01/animals.out

 1 class Bear : public Animal {
 2 public:
 3    Bear () {
 4       printf("Creating a new \"Bear\".\n");
 5    }
 6    ~Bear ();
 7 }
 8 
 9 
10 class Cat : public Animal {
11 public:
12    Cat () {
13       printf("Creating a new \"Cat\".\n");
14    }
15    ~Cat ();
16 }
17 
18 
19 class Dog : public Animal {
20 public:
21    Dog () {
22       printf("Creating a new \"Dog\".\n");
23    }
24    ~Dog ();
25 }

And that's exactly the output we wanted. Note that the Tcl interpreter does most of the work: it reads and interprets the foreach loop and the 'puts' command, and performs the variable substitutions. We did not have to implement a new language, we just used an existing one. To use a language other than Tcl, the tool has to be slightly rewritten (mostly just to change 'puts' into something else, and to use different syntax for variable substitution). We will look at more examples below.

So far, so good. Now for the bad news. First of all, this technique begins to break down for larger projects. After a while, the input is littered with '@' signs and it becomes difficult to find out whether you're in copy mode or in decoration mode. We could ease this pain by replacing the '@' signs with asymmetric signs such as '<@' and '@>', to make the transitions between different modes visually distinct. This would make our input look a lot like JSP or PHP, which uses similar syntax to generate html pages.

A much more severe problem is indentation. In the example above, you may have noticed that the input and the intermediate Tcl script are not properly indented. The foreach loop and the class statement are at the same level of indentation, even though the latter is logically nested inside the former. This may not seem like a big problem, but it becomes a show-stopper when using Python as one of the languages. Python demands correct indentation, and g2pp can't deliver.

For a good example of chaotic indentation, let's look at the following slightly more complicated input:

example_02/indent_01.in

 1 foreach shape {Rectangle Triangle Circle} {
 2    @class $(shape) : public Shape {
 3       @if { $shape == "Circle" } {
 4          @public:@
 5       } else {
 6          @private:@
 7       }@
 8       $(shape)() {}; // Default constructor.
 9    };@
10 }

The proliferation of '@' signs is bad enough, but just look at the indentation in the intermediate script:

example_02/indent_01.tcl

 1 foreach shape {Rectangle Triangle Circle} {
 2 
 3 puts "class ${shape} : public Shape \{
 4       "
 5 if { $shape == "Circle" } {
 6 
 7 puts "public:"
 8 
 9       } else {
10 
11 puts "private:"
12 
13       }
14 puts "
15       ${shape}() \{\}; // Default constructor.
16    \};"
17 
18 }

and in the final output:

example_02/indent_01.out

 1 class Rectangle : public Shape {
 2 
 3 private:
 4 
 5       Rectangle() {}; // Default constructor.
 6    };
 7 class Triangle : public Shape {
 8 
 9 private:
10 
11       Triangle() {}; // Default constructor.
12    };
13 class Circle : public Shape {
14 
15 public:
16 
17       Circle() {}; // Default constructor.
18    };

This is not exactly readable, to put it mildly. For large projects, I have found g2pp impossible to work with. You may think that you could fix the indentation by placing the '@' signs differently, but it will never look exactly the way you want it to. The truth is that g2pp is totally unaware of indentation and nesting, and it cannot produce properly indented code, except in extremely simple cases.

Don't worry. The next section will introduce a solution.

So, to wrap up, we have the following advantages:

You no longer have to explicitly write 'puts', escape characters and other decoration.
The preprocessing language is not a new language that you have to learn; just use your own favorite scripting language.

And we have the following disadvantages:

In big input files, you quiclky lose track of which mode you're in.
Indentation is impossible to get right.

If you want to play around with g2pp, you can download the C code. It's not the most robust tool, and it shows its age, but you should find it easy to read (only about 300 lines, all in a single file) and easy to adapt to your own needs.

Line-based preprocessing to the rescue

It took me a long time to figure out how to tackle the indentation problems we discussed above. The key to the solution turns out to be line-based preprocessing. Instead of allowing a "mode switch" anywhere in the input, we now allow it only between lines. Each line in the input "belongs" to either the preprocessing language, or the target language. Each line has an indentation, and it turns out to be very easy to divide this indentation between the intermediate script and the target output.

Since the resulting tool is a front-end (aka preprocessor) that works with lines, I called it FrontLine. Let's kick off immediately with an example.

example_04/tst_tcl_animals.in

 1 =! tcl
 2 =! cxx
 3 
 4 = cxx
 5 #include "animal.h"
 6 
 7 = tcl
 8 foreach animal {Bear Cat Dog} {
 9    set lowercase [string tolower $animal]
10    set uppercase [string toupper $animal]
11 
12    = cxx
13    class $(animal) : public Animal {
14    public:
15       $(animal)() {
16          printf("Constructor: creating a new $(lowercase).\n");
17          = tcl
18          if { $animal == "Cat" } {
19             = cxx
20             printf("A CAT HAS 9 LIVES.\n");
21          } else {
22             = cxx
23             printf("A $(uppercase) HAS ONLY 1 LIFE.\n");
24          }
25       }
26    };
27 
28 }

I used colors to (hopefully) clarify what is going on. The yellow lines, beginning with a '=' sign, are directives that tell the tool what to do. They do not end up in the intermediate script or the target output. The first 2 directives declare the languages we're going to use: 'tcl' as the preprocessing language, and 'cxx' as the target language.

Once these 2 languages are set up, we switch between them with the = tcl or = cxx directives. These directives are "scoped", i.e. they are fully aware of line nesting. So when we come back from an inner = cxx block, we automatically pop back into the outer = tcl block without needing an additional directive. This allows us to close the foreach-loop at line 8 with a curly brace at line 28 without having to go back into tcl mode first.

Note that each directive is on a separate line on its own, so we can no longer switch between modes in the middle of a line, only between lines.

Apart from replacing a lot of '@' signs with directives, nothing much has changed in the input. Inside a C++ block, we still use $(...) for variable substitution. But now take a look at the intermediate script that FrontLine generates from our input:

example_04/tst_tcl_animals.out.tcl

 1 #!/usr/bin/tclsh
 2 
 3 
 4 puts "#include \"animal.h\""
 5 puts ""
 6 foreach animal {Bear Cat Dog} {
 7    set lowercase [string tolower $animal]
 8    set uppercase [string toupper $animal]
 9 
10    puts "class ${animal} : public Animal \{"
11    puts "public:"
12    puts "   ${animal}() \{"
13    puts "      printf(\"Constructor: creating a new ${lowercase}.\\n\");"
14    if { $animal == "Cat" } {
15       puts "      printf(\"A CAT HAS 9 LIVES.\\n\");"
16    } else {
17       puts "      printf(\"A ${uppercase} HAS ONLY 1 LIFE.\\n\");"
18    }
19    puts "   \}"
20    puts "\};"
21    puts ""
22 }

As before, the Tcl code has been copied verbatim, and the target code has been decorated. But this time, the body of the foreach loop is correctly indented! The rest of the input indentation went into the string literals after the puts commands, and this is also correctly indented. Look at the colors: prepro-level indentation (blue) and target-level indentation (green) are mixed in the input, but are cleanly separated in the script. Running this script in turn leads to correct indentation in the final output:

example_04/tst_tcl_animals.out

 1 #include "animal.h"
 2 
 3 class Bear : public Animal {
 4 public:
 5    Bear() {
 6       printf("Constructor: creating a new bear.\n");
 7       printf("A BEAR HAS ONLY 1 LIFE.\n");
 8    }
 9 };
10 
11 class Cat : public Animal {
12 public:
13    Cat() {
14       printf("Constructor: creating a new cat.\n");
15       printf("A CAT HAS 9 LIVES.\n");
16    }
17 };
18 
19 class Dog : public Animal {
20 public:
21    Dog() {
22       printf("Constructor: creating a new dog.\n");
23       printf("A DOG HAS ONLY 1 LIFE.\n");
24    }
25 };

Obviously, this time you only see target-level indentation (green). Compare the colors in the 3 files, and see if you can figure out where all the pieces of indentation come from.

Whew. So this time, the input, the intermediate script and the output are all correctly indented! The key to this humble success is that we divided the input indentation into 2 parts: one for the script, and one for the target output. How can the tool decide where the indentation should go? This turns out to be surprisingly easy and natural.

Remember that every line belongs unambiguously to one of the 2 languages. The indentation rule is then as follows: Whenever a line is indented with respect to its predecessor line, the extra indentation goes to the language of that predecessor. So the indentation after the foreach line must be allocated to Tcl, because the foreach itself is written in Tcl. The indentation after the public label goes to C++, because the label itself is written in C++.

This indentation rule makes sense when you think about why our lines are indented in the first place. The body of the foreach loop is indented because it is logically part of the loop. The foreach line causes the indentation, so it makes sense to allocate this indentation to the Tcl language.

This trick works not only for Tcl, but for any language that is line-based and properly indented. In practice, even freeform languages (like C++ or C#) can be used, because even in such languages we typically write line-based and correctly indented code anyway, don't we ;-)

Additional features, and some downsides

To support languages other than Tcl, you need to implement a new Language class in FrontLine. This is very easy in practice. The class must have a method for decorating a given line of input, and another method for executing the intermediate script. Here is an example for Ruby:

 1 oblet_class Language_ruby : Language {
 2    # This class implements Ruby support in FrontLine.
 3 
 4    proc decorate {fod prepro_indent target_indent text} {
 5       # Decorate a line and send it to the intermediate script
 6       # (which is given by the file descriptor 'fod').
 7       # The other parameters are the line's indentation
 8       # (divided between the preprocessor and target languages),
 9       # and its text after the indentation.
10 
11       # Escape all characters that are special in Ruby.
12       set str ""
13       for {set i 0} {$i < [string length $text]} {incr i} {
14          set c [string index $text $i]
15          if {    $c == "\\" || $c == "\#" \
16               || $c == "\"" || $c == "\'" } {
17             append str "\\"
18          }
19          append str $c
20       }
21 
22       # Handle all the $(xxx) substitutions in the line's text.
23       while { [regexp {^(.*)\$\(([a-zA-Z0-9_]+)\)(.*)$} $str -> pre var post] } {
24          # String substitution in Ruby is done with '#'.
25          set str "${pre}#\{${var}\}${post}"
26       }
27 
28       # Send the decorated line to the intermediate script.
29       # Ruby uses 'puts' to print lines.
30       # Note the different locations of the indentations!
31       puts $fod "${prepro_indent}puts \"${target_indent}${str}\""
32    }
33 
34    proc execute {script output_filename} {
35       # Execute 'script' (which is the intermediate Ruby script we generated),
36       # and send its output to 'output_filename'.
37 
38       catch {eval exec /usr/bin/ruby $script} result
39       set fod [open $output_filename w]
40       puts $fod $result
41       close $fod
42    }
43 }

This is all you need to do. The class has to implement a decorate method which, as you can see, escapes all the special characters and replaces variable substitution with the correct Ruby syntax. It then sends the result to the intermediate Ruby script (that's what the puts $fod line is for). Note that the method receives 2 pieces of indentation: the prepro_indent is for Ruby itself and goes before the call to puts. The target_indent is for the final output, so it becomes part of the string literal after the puts.

We also implement an execute method to turn the intermediate Ruby script into its final output. The example is clearly written for linux, since it executes /usr/bin/ruby to interpret the script. Similar code will work on other platforms.

The Language_ruby class is written in Tcl because the FrontLine tool itself is too. Even if you don't know Tcl, you should be able to adapt this example to support your own scripting language. If not, just let me know and I'll be glad to help.

Adding this little class to the FrontLine tool allows us to use Ruby as the preprocessing language:

example_05/tst_ruby_shapes.in

 1 *! ruby
 2 *! txt
 3 
 4 ["magenta", "red", "yellow"].each { |color|
 5    * txt
 6    Current color: $(color).
 7       * ruby
 8       first= true
 9       ["ellipse", "rectangle", "triangle"].each do |shape|
10          if(first)
11             * txt
12             We have a $(color) $(shape),
13          else
14             * txt
15             and a $(color) $(shape),
16          end
17          first= false
18       end
19    * txt
20    and those are all the $(color) shapes.
21 }

This example also shows another feature: the directives can start with characters other than '='. The tool simply takes the first character in the input file (in this case '*') and uses that as the directive marker. The first 2 lines in the input have to be directives anyway (to inform the tool of the 2 languages we intend to use). You can choose a marker that does not interfere with your languages (e.g. when generating C++ you cannot use '*' because it may occur as the first character of a pointer dereferencing).

Here's the intermediate Ruby script:

example_05/tst_ruby_shapes.out.ruby

 1 ["magenta", "red", "yellow"].each { |color|
 2    puts "Current color: #{color}."
 3    first= true
 4    ["ellipse", "rectangle", "triangle"].each do |shape|
 5       if(first)
 6          puts "   We have a #{color} #{shape},"
 7       else
 8          puts "   and a #{color} #{shape},"
 9       end
10       first= false
11    end
12    puts "and those are all the #{color} shapes."
13 }

And the final ouput:

example_05/tst_ruby_shapes.out

 1 Current color: magenta.
 2    We have a magenta ellipse,
 3    and a magenta rectangle,
 4    and a magenta triangle,
 5 and those are all the magenta shapes.
 6 Current color: red.
 7    We have a red ellipse,
 8    and a red rectangle,
 9    and a red triangle,
10 and those are all the red shapes.
11 Current color: yellow.
12    We have a yellow ellipse,
13    and a yellow rectangle,
14    and a yellow triangle,
15 and those are all the yellow shapes.

When too many directives start to clutter the input, FrontLine offers another mechanism to indicate the language for each line: explicit markers. The following example (which uses Lua as the preprocessing language) shows that you can reserve a marker of your choice for any language, and use it to mark individual lines:

example_06/tst_lua_messages.in

 1 =! lua
 2 =! txt -marker :
 3 
 4 -- This is a Lua function with an optional 'exclam' parameter.
 5 function go(msg, exclam)
 6    if(exclam) then
 7       msg= msg .. "!"
 8       urgency= "urgent"
 9       = txt
10 
11       Listen up!
12    else
13       msg= msg .. "."
14       urgency= "everyday"
15 
16       -- We want a blank line in front of every message:
17 :
18    end
19 
20 :  I received an $(urgency) message for you:
21 :            "$(msg)"
22 end
23 
24 -- Note that the strings do not end in punctuation;
25 -- they get it from the 'go' function.
26 
27 go("Help, I can't find the cheese", true)
28 go("And there is a pair of shoes in the fridge", true)
29 go("Never mind... I found the cheese under the coat rack")

We choose ':' as a marker for the target language. You can see 3 lines with this marker (one of them is a "blank" line, i.e. it does not contain anything other than the marker). Note that the marker is considered as part of the indentation, to guarantee maximal readability; the markers are replaced by spaces before further processing. This implies that you cannot use markers for lines without indentation (e.g. at the start of the input file); you must use directives there.

Here is the resulting output:

example_06/tst_lua_messages.out

 1 Listen up!
 2 I received an urgent message for you:
 3           "Help, I can't find the cheese!"
 4 
 5 Listen up!
 6 I received an urgent message for you:
 7           "And there is a pair of shoes in the fridge!"
 8 
 9 I received an everyday message for you:
10           "Never mind... I found the cheese under the coat rack."

I have not explicitly mentioned it, but there are 4 (four!) languages involved in the FrontLine process. In this particular Lua example:

Lua itself, the preprocessing language.
The target language, which in this case is just raw text.
The directive language, i.e. the lines beginning with '='. This is not a full programming language of course, only a (very) small domain-specific language that helps glue the other 2 languages together.
The language in which the FrontLine tool itself is implemented (in this case Tcl). All language support classes (such as the Language_ruby class above) are also written in Tcl.

FrontLine is far from perfect. It blindly uses the input indentation to determine the output indentation. This fails when you put some of your code generation inside a function in the preprocessing language. A more advanced tool could fix this by allowing explicit indentation directives, or by sending the output temporarily to a string which can then be spliced at the right place.

You can download FrontLine here and implement support for your own favorite scripting language by writing a new Language_xxx class. Don't feel shy to mail me for help. Also, please send me all your improvements, extensions, new ideas, and ports to other languages. I will publish all material on this website.

Other preprocessing tools

Programmers have been using tools to generate target programs for ages. The idea of switching between copy mode and decoration mode is not new. When using PHP to produce HTML files, you switch between "PHP mode" and "HTML mode" using <?php ...> tags. Another tool that follows a similar approach is SNIP, by Fred Wild. For a quick idea, have a look at one of his examples, where you will see things like this:

 1 .EAch_obj
 2            :class  <obj.name> ;
 3 .ENd

The commands starting with a '.' are the preprocessing commands. The other lines, beginning with a ':', are target lines. No decoration is needed, only a special syntax with <angle brackets> to splice values into the output.

Unlike SNIP, the techniques in this article use an existing language (Tcl, Ruby, Python, Lua, ...) as the preprocessing language. This lowers the learning curve, and makes all the existing libraries and extensions of your favorite language available. And instead of developing a complete interpreter, we just use a very simple tool that produces an intermediate script, and let the existing interpreter run that script. Adding a new step makes the process of code generation simpler and more flexible.

Another great tool for Tcl-based preprocessing is the Expand Tool by Will Duquette. It is geared towards generating HTML, but can be used for other purposes too. It also avoids creating a new preprocessing language by using an existing one.

Conclusion:

Use an existing language to generate code or other forms of structured text.
A simple tool can generate the escape characters and other decoration automatically. This makes the target code more visible in the input.
The tool can automatically figure out the correct indentation for both the intermediate script and the final output. This only works for line-based code generation, which is the most common kind.