How to avoid parsing

Abstract:

Parsing is the process of turning a stream into a structure, e.g. turning a file or string into a syntax tree.
If the input stream follows a very simple format, you can often avoid writing a parser at all. This article shows a few of the many possible tricks to obtain a parser almost for free.
Many of these tricks require a very precise input syntax. This works for small data formats, but fails for more advanced input languages. You mileage may vary.

A simple example

Suppose we want to write a tool for painting various shapes. The tool allows you to create drawings and save those in a human-readable file format. Obviously, the tool must also be able to read such files back from disk, and this is where it gets interesting: It looks like we're going to need a parser.

The format of the saved files could look like this:

 1 rectangle 10 10 200 100 red
 2 circle 50 50 20 green
 3 group car {
 4   circle 20 200 50 black
 5   circle 130 200 50 black
 6   rectangle 10 100 140 180 blue
 7 }

The format contains one line for each shape to be drawn. Shapes have coordinates, colors, and other data attached to them. To make things a bit more challenging (and interesting), we also want to support nested groups of shapes, such as the group called 'car' in the example, which contains 2 black wheel shapes and a rectangular car body (not exactly aerodynamic, but it gets us from A to B).

Let's figure out how to parse this format without writing a parser for it (at least not in the traditional sense). We may sometimes have to tweak the format slightly to fit our needs, but this is a small price to pay for a free parser!

Use regular expressions for simple input formats

Regular expressions are well-known, and are supported in many programming languages.

You typically use them to parse simple line-based formats, such as the input of a Unix filter, or little configuration files. In our case, it would be easy to parse the lines that describe individual shapes, and a bit harder to parse nested data such as the group. The more structure we want in our format (e.g. complex nested structures, ambiguous input, back-references to earlier defined items), the harder it will be to use regular expressions.

You can already find plenty of information about regular expressions in books and on the web, so I won't go into them here. Instead, let me show you some tricks that you may never have seen before.

Use existing syntax

By tweaking the data format slightly, we can make it fit the syntax of an existing language. For example, instead of writing a Python parser for our shapes data, what if we just use Python itself as the data format? Our painting tool would then just produce output that looks like this:

code_python/data.py

 1 import shapes
 2 shapes.rectangle(10, 10, 200, 100, "red")
 3 shapes.circle(50, 50, 20, "green")

You write some Python functions called 'rectangle' and 'circle', put them in a module called 'shapes', and then just import your data file directly:

code_python/prog.py

 1 # This is how you "parse" the data format: just import it!
 2 import data

Python is now doing all the parsing for you; your data file is just a Python script. The "data" actually consists of a sequence of calls to python functions. You write those functions so that they fill a data structure from their argument values. The resulting structure looks as if we had "parsed" it from the input file explicitly.

Note that we had to tweak the input format to make it fit the syntax of python. We need parentheses around the shape parameters, and commas between them. We also need quotes around the color names to turn them into Python string literals.

For nested data structures, we need a bit more work. One solution would be to write some Python functions that turn their arguments into objects and return those. You can then create the data structure in a sequence of explicit steps:

 1 # Nested data can be "linearized" into a sequence of calls, like this:
 2 grp_car= shapes.group("car")
 3 grp_car.add(shapes.circle(20, 200, 50, "black"))
 4 grp_car.add(shapes.circle(130, 200, 50, "black"))
 5 grp_car.add(shapes.rectangle(10, 100, 140, 180, "blue"))

This gets the job done, but it's not exactly elegant. We've completely lost the clean, nested structure in the original input file. It makes our input less human-readable, and starts to look more like a Python script than a data format. The more complex the format becomes, the less useful this parsing technique will turn out to be.

This trick works for many languages, not only Python. It works best with dynamic/scripting languages, because they let you import/source/open/include scripts dynamically. But I've even used it with C++, where the "data file" was a piece of calls to C++ constructors. This file had to be compiled into an executable before it could be used, so I basically used the C++ compiler to do all the parsing for me. For some purposes, this may be sufficient. And it serves as a good example of how far you can stretch the concept of "parsing". Quite far, that one.

One drawback of this technique is that our data format has to strictly follow the syntax of the "host" language. A language like Python has a rather strict syntax, so you cannot "mold" it into any format you like. We'll look at more flexible languages next.

Another drawback, oddly, is that the data format now has too much power. It inherits the full power of its host language (e.g. python), which may be Too Much Of A Good Thing. For example, our data file could now contain full-blown control flow and other statements, like these:

code_python/data.py

 1 xList= [10, 20, 30]
 2 for x in xList:
 3   shapes.rectangle(x, x, 10, 10, "blue")

This is not really a data format anymore, but a script or program. It makes the data less readable. Of course, you can always just "flatten" the format into its most basic form, by importing the script and then writing out the resulting data structure without the use of any fancy statements. In Python, it is even very easy to produce the same data in another format, which would help you migrate to a more language-neutral form of parsing later. So you can use this trick as a first step, and move to more advanced parsing as the need arises.

Abuse existing syntax

We now take the previous trick a few miles further. Unlike Python, some languages have an extremely flexible syntax. In fact, you could argue that languages like Lisp don't even have a syntax at all. Lisp allows you to define macros that seem to extend the language with new syntactic constructs. I don't know enough about Lisp to provide any details, but I will give an example in another language that I know a bit better: Tcl (the Tool Command Language).

Like Lisp, Tcl has a very flexible syntax. Every statement in Tcl is just the name of a procedure to be called, followed by the arguments. The arguments are separated by white space, so no extra commas or parentheses are needed. The great thing is that procedure names can contain non-alphanumeric characters, so you can make a command called '//' or even '.'. The arguments can be nested blocks of statements, so we get nested structures for free.

To parse our shape data, we need some Tcl procedures like these:

 1 proc rectangle {x y w h color} { ... }
 2 proc circle {x y radius color} { ... }
 3 proc group {name body} {
 4    ...
 5    uplevel $body
 6    ...
 7 }

I have omitted the bodies of the procedures so we can focus on their interface.

The 'rectangle' and 'circle' procedures can be called simply like this:

 1 rectangle 10 10 200 100 red
 2 circle 50 50 20 green

This is exactly the input format we started from. No quotes, no commas, no parentheses. Tcl does not require quotes around strings if they do not contain spaces. And it separates arguments with white space rather than commas or other separators. This gives us a very simple, clean syntax. OK, I admit it: I already had Tcl in mind when presenting the data format at the start of this article ;-)

The 'group' procedure requires a bit of explanation. It takes 2 arguments: The name of the group, and the "body" of the group:

 1 proc group {name body} {
 2    ...
 3    uplevel $body
 4    ...
 5 }

The body contains Tcl statements recursively. Luckily, Tcl has a built-in uplevel command which allows us to evaluate such a block of statements. Some of those statements may turn out to be invocations of 'rectangle' or 'circle', so we need to implement those so that they inject their new shapes into the enclosing group. The 'group' procedure needs to set up the enclosing group for the nested calls, e.g. by storing it on a global stack.

If any of the statements in the group body is a call to 'group', the whole thing gets invoked recursively and the new group gets injected into the old one by means of the global stack. This is how the nested data structure grows.

You probably guessed that in Tcl, statement blocks are written with curly braces. So we can invoke the 'group' procedure like this:

 1 group car {
 2    circle 20 200 50 black
 3    circle 130 200 50 black
 4    rectangle 10 100 140 180 blue
 5 }

Again, this is exactly the input format we wanted. It is much easier to preserve nesting and other more complex structures in our syntax, because Tcl is flexible enough to accomodate them.

Taking it too far

You can take the previous technique pretty far, and in fact it is all too easy to take it too far. Consider the following set of Tcl procedures (details obviously omitted):

 1 proc // {args} { ... }
 2 proc class {name args} { ... }
 3 proc public: {} { ... }
 4 proc private: {} { ... }
 5 proc int {name args} { ... }
 6 proc string {name args} { ... }
 7 proc ctor {args} { ... }

This little set of procedures, when properly implemented, allows you to write the following, in Tcl:

code_tcl/syn_c.tcl

 1 // This looks vaguely like a C++ class, but it's actually Tcl!
 2 class Dog : public Animal {
 3 private:
 4    int m_age = 5;
 5    string m_name;
 6 public:
 7    ctor {}
 8    ctor {int age}
 9    ctor {int age , string name}
10 };

You have to admit this is pretty cool. Without going into the details, you can probably guess that the procedure called '//' simply ignores all its arguments, and that the procedure called 'class' evaluates its final argument as a block of statements using 'uplevel'. This block of statements in turn contains declarations for constructors and data members, and invcations of the 'private:' and 'public:' procedures. Thanks to the flexibility of Tcl syntax, we've written a kind of "almost-C++ parser" in just a few simple lines of Tcl code.

You can go quite far in mimicking other languages (here C++), and the whole point is that you do not have to write a parser, just a set of simple procedures.

However, this technique can quickly run out of control. Just think about all the little details that could go wrong:

 1 // You cannot use [square brackets] in comments.

Square brackets have a special meaning in Tcl. You cannot just use them anywhere you want. You may think you're using them in a comment, but you're actually using them in a regular Tcl statement, which happens to be a call to the '//' procedure!

 1 int m_age= 5;

Forgetting a space before the '=' may change the behaviour of the 'int' procedure. It is very hard to write 'int' so that it takes care of all those subtleties. In fact, you would have to use regular expressions or another parsing technique to make 'int' sufficiently robust.

In other words: sculpting an existing syntax into a new one can be extremely powerful, and can provide advanced syntax such as nesting with very little effort; but use it carefully or the details will take revenge.

If you want to dig deeper, here's a paper on how to use Tcl for parsing. Now, let's move on.

Parsing an object-oriented interface

So far, we've been parsing data for a drawing tool. We now turn to another important application: parsing a class schema.

A schema is a set of class definitions, each with some member variables and member functions, but without any code in the function bodies. It describes an object-oriented interface without revealing its implementation.

Here is a simple schema for kitchen gear:

 1 class Mixer {
 2   member brand : string
 3   member motor : Motor
 4 }
 5 
 6 class Motor {
 7   member brand  : string
 8   member speed  : int
 9 }

We could use this schema to control the speed of a motor in an actual bread mixer, or we could use it to store information for a catalog of appliances. I want to stress that we're not going to develop a parser for such catalog data, but for the the schema itself.

Why would you want to parse a class schema? To generate code from it. The generated code can take care of a lot of boring low-level details for you. For example, it could automatically make your C++ mixer code available in a scripting language. SWIG is a brilliant tool that turns a "schema" (actually just a C++ header file) into a library for accessing the C++ code from a scripting language (including Tcl, Ruby, Python and many others).

Another very common idea is to make your application available for testing on a PC, via a serial connection (or even remotely via the internet). The communication between the mixer and the PC contains a lot of boilerplate code which can easily be generated. Whenever you add a new class (such as PowerSupply) to the schema, the generator produces all the code to package and unpackage remote message calls to this class and its methods. This allows you to invoke instances of this class from a remote PC, to test or interrogate the embedded system while it was running on the mixer.

One approach to parsing a class schema is the one we used in the previous section: we abuse the syntax of an existing language. Here is an outline, again in Tcl:

 1 proc class {class_name args} {
 2    if { [lindex $args 0] == ":" } {
 3       ... (Set up an inheritance relation between this class and its base classes)
 4    }
 5 
 6    # Remember which class we're in.  Yes, I know, global variables are evil.
 7    global current_class_name
 8    set current_class_name $class_name
 9 
10    # Evaluate the class body using 'eval' or 'uplevel'.
11    eval [lindex $args end]
12 
13    # We're no longer in a class body.
14    set current_class_name ""
15 }
16 
17 proc member {args} {
18    global current_class_name
19    ... (add the new member to the current class)
20 }
21 
22 proc method {args} {
23    global current_class_name
24    ... (add the new method to the current class)
25 }

I omitted some of the details to show you just the big picture. But it only takes a few 100 lines to fill in those details. Once we've set up the 3 procedures (class, member and method), we can write a schema like the one above and "parse" it simply by executing it from a Tcl script:

 1 source schema.txt

That's it! The procedures simply read their arguments and set up a data structure. From that structure, you then output the target code.

As before, this approach has its limitations. It works wonders for simple schemas in simple applications. In the next section, we look at an alternative which is both more powerful and more robust.

Using reflection to parse a schema

Here's our little schema again, this time in C# syntax:

 1 [Interface]
 2 public abstract class Mixer {
 3     public string brand;
 4     [Child]
 5     public Motor motor;
 6 }
 7 
 8 [Interface]
 9 public abstract class Motor
10 {
11     public abstract void start(double speed);
12     public abstract void stop();
13 
14     public string brand;
15     public int speed;
16 }

The reason I switch to C#, is because it has excellent reflection facilities. Instead of writing a C# parser, we can use reflection to iterate over the classes and their members. I even added some methods to the Motor class, because methods can be handled just as easily. Both methods are abstract because we do not intend to implement them, they're only an interface for the code generator. You can also see that the motor member is marked as Child, which indicates that the motor is a part of the mixer.

The classes themselves are marked with the Interface attribute, so that the code generator knows that it has to "parse" this class and generate code for it. You can invent your own attributes to give your schema more semantics. You could also use actual C# interfaces, but then you have to use properties instead of member variables.

To parse the schema, we use reflection:

 1 // Iterate all the types in the executable.
 2 foreach (Type tp in Assembly.GetExecutingAssembly().GetTypes().OrderBy(t => t.Name))
 3 {
 4    // Keep only the types that have an 'Interface' attribute on them.
 5    if(tp.GetCustomAttributes(typeof(Interface), false).Count() > 0)
 6    {
 7       if (tp.IsEnum)
 8       {
 9          ... // Handle enumerations.
10       }
11       else if (tp.IsValueType)
12       {
13          ... // Handle structs.
14       }
15       else
16       {
17          // Iterate the members of the type.
18          foreach (MemberInfo memberInfo in tp.GetMembers().OrderBy(m => m.Name))
19          {
20             if (memberInfo.DeclaringType != tp)
21             {
22                 // Ignore this member: it's inherited from elsewhere.
23                 continue;
24             }
25 
26             string memberKind = memberInfo.MemberType.ToString();
27 
28             if (memberKind == "Field")
29             {
30                FieldInfo fieldInfo = (FieldInfo)memberInfo;
31                ... // Handle a member variable.
32             }
33             else if (memberKind == "Method")
34             {
35                MethodInfo methodInfo = (MethodInfo)memberInfo;
36                ... // Handle a method.
37             }
38             else if (memberKind == "Constructor")
39             {
40                ... // Handle a constructor.
41             }
42             else
43             {
44                ... // Raise an error (unsupported kind of class member).
45             }
46          }
47       }
48    }
49 }

As you can see, we do not really "parse" anything, we just run over the type system. The C# parser has already created all the information on attributes and members, and has made it available to us through reflection.

Conclusion:

I hope this article shows that "parsing" can take many different forms.
In many situations, you can avoid writing a parser by using existing tools, existing syntax, reflection, or other techniques.
Regular expressions are a widely supported and easy tool for parsing simple formats, but they fail for more complex formats.
You can use a host language to do the heavy lifting for you. Some languages like Tcl have a very flexible syntax so you can mimick many kinds of input formats, including nested structures. Again, this fails for more complex formats.
In dynamic languages, you can often use reflection to walk over a type system. This is a great trick if you need to parse an object-oriented interface/schema, e.g. to generate code for marshalling or scripted access to an object-oriented system.

Previous: Parsing: introduction

Next: The input stream