on in Antlr4
Writing parsers: the right way
At least once in your coding adventures you will land on writing your own parser. Whether it is to read a custom text format or interpret a custom language, the ultimate structure and solution will likely be the same. You will need to create a module that takes strings as input and produces meaningful data structures out of it. Let’s see what the best approach to this task is by taking on a parsing problem on a simple language and ‘evolving’ a solution of our own.
To be exact, we’ll be looking at parsing .NET XML Documentation Comments. While this format is mostly XML, it does contain custom URI strings to denote a code target (type, property, field etc.) for the documentation. Some examples of these strings can be seen below.
Looks fairly straightforward, right? Well… We shall see! Let’s try to solve this in a naïve way.
In the naïve case, the parsing problem is approached as a purely coding task. While this may work for simple formats, the code often ends up rather ugly and contains a number of bugs.
Code like this is not unusual for parsing - let’s pick it apart to underline why exactly it is bad.
First of all, it relies on a number of magical constants: it uses characters like dot/parenthesis and offsets to identify individual parts of the string. Hard-coding offsets is prone to errors (infamous off-by-one error), as it is difficult to follow the code and, as a result, it isn’t immediately clear why offset has the value it does.
Secondly, it’s difficult to consider all the edge cases and the code looks really clunky if you do do this. For example, String.IndexOf and String.LastIndexOf can return -1 if the token we are looking for could not be found. Additionally, we need to check the length of the string before indexing to ensure that it’s long enough.
If you still are convinced that writing a parser manually would work out, let’s look at more advanced cases of the chosen format. Personally, I think the generic types push this over the edge and suggest looking for a better approach.
If you’ve done a course in Computer Science, you probably have been introduced to a more formal approach to parsing - using grammars. In this case, grammar is a set of rules that defines what the format must look like. Grammars are laid out in a logical way and enable you to effortlessly cover all the edge cases we previously discussed.
I have chosen antlr4 for this article, which stands for ANother Tool for Language Recognition. This is a rather popular choice that is supported well by both the community and industry (Microsoft used it in ASP.NET to optimise CSS/JS). More importantly, it supports generating parsers in many programming languages including C#, which means that we’ll be able to integrate with it easily. You can follow the official guide (linked above) to get your copy of antlr4 up and running.
Writing a grammar
So, let’s dive into declaring our first very own grammar. The grammar we shall
build will cover a simple case of
and ignore any whitespace. Before formalising the grammar, we need to break
down this string into the individual components. First of all, we can see that
it starts with a prefix
T:, which indicates that the code unit is a C# type.
It is then followed by namespace
Microsoft.Extensions.DependencyInjection.Extensions, which is made of
Extensions. Finally, the string is concluded with
ServiceCollectionDescriptorExtensions after a dot, which is a type name.
Now, let’s formalise these rules in antlr4 grammar format. We shall start with the top-level rules. The grammar below defines the components of the string exactly the same way as we established previously.
Now, we need to define what
namespace is. As we mentioned previously, the
namespace is a sequence of dot-separated components. Based on our C# knowledge,
we also know that all types have to be enclosed with a namespace. This means
that at least one namespace component has to be present. We can formalise this
rule in the following manner.
Here you can see a classic pattern for defining a sequence to parse. If you
have done regular expressions before, you will find this rule quite
straightforward to interpret. Basically, we expect at least one
namespace_component to be present. Afterwards, 0 or more (defined by
namespace_component items can follow, each prepended with a dot. This will
match both simple namespaces (e.g., System) and more complex ones (e.g.,
Descending on a lower level, we now need to define what
type_name is. Both of these are C# code entity names, so they have to be
alphanumeric and start with a letter. This can be represented by the following
[a-zA-Z][a-zA-Z0-9]*. This can be formalised with the
following grammar rules (note that all-caps rules are lexer rule).
After combining all of these rules together, we will end up with a grammar below. While it may seem a bit excessive at first, these building blocks save us a lot of time later on when more advanced constructs are added. Note that we also added a rule for whitespace and redirected matching tokens to a hidden channel, which allows us to ignore whitespace.
antlr4 is a parser generator, so it generates a parser in the given programming language. We can generate a parser from our grammar using the command shown below. As you can see, the system is made of 2 primary components: lexer and parser. This is something we previously omitted for simplicity, but essentially lexer converts text into low-level tokens whilst parser combines them into high-level structures with rules.
Needless to say, we will also need to compile the parser before we can use it.
This can be done with a classic invocation of Java compiler,
Afterwards, we can attempt to parse the above-mentioned string.
As you can see, we have successfully parsed the string into a type declaration. Using similar techniques, we can extend the grammar to parse other entities like methods, properties, fields and alike. After developing the parser, the question becomes - how do we utilise it in our program? Previously, you might have noticed that antlr4 CLI tool generates Java code. Luckily, that’s not the only generator available - languages like Python, C#, C++, Go and others are available.
Generating a parser
In my case, I was interested in making use of the parser in C# programming language. Pure C# solution (i.e., one that does not require Java installed) is currently available as pre-release Antlr4.CodeGenerator 4.6.5 package. Integrating it into your build process is as simple as dropping a few lines into your .csproj file.
This solution is fully integrated into your project and works in all development (Visual Studio, Visual Studio Code etc.) and integration environments (TFS Build, CLI etc.). This means that building your project remains a simple msbuild invocation. Additionally, you avoid checking in auto-generated files into source control (which is a good practice). Unlike Visual Studio Code, Visual Studio is also smart enough to perform auto-complete on the auto-generated files (Parser/Lexer classes).
Integrating with antlr4
The lexer and parser will be generated in the
specified. However, there still is a bit of glue code required in order to turn
an arbitrary string into meaningful objects. Firstly, we need to initialise the
antlr4 parsing pipeline.
Having created a parser, we are now capable of extracting the knowledge from
the parse tree that the former produces. antlr4 offers two different
approaches for this task: listener and visitor. First one is based on the idea
of listening for tokens, accumulating knowledge in the listener state and
returning it when requested. Needless to say, using this pattern will likely
result in a large amount of long listener classes. The second approach is based
on the idea that there is a one-to-one mapping between grammar rules and
business objects. While this may not work for all the grammars, it allows us to
avoid writing unnecessary code. This factor is what pushed us to opt for the
visitor pattern. The generation of a visitor base class can be enabled by
Visitor to true in your .csproj file.
As you can see in a very basic example above, we transform a
XmlDocTargetType instance. This code can then be further extended to
support translation of other rules. Making use of the visitor we created is
also fairly straightforward, see the snippet below for reference (
the object we created earlier).
Smashing! This way you can parse your custom text formats with ease and knowing that it’s virtually unbreakable. What is more, your code will remain lean and easily readable. So, happy parsing and see you again!