10 Jan 2021

FinTS101 Ep. 3: The Parse Tree

The TL;DR

I managed to setup my little project with node.js and antlr.
The written code will output tagged elements from a parse of a segmentkopf element based on a first grammar.
I had to overcome some problems translating from java examples to javascript for what I tried to do.
I will stick with ANTLR.

Project setup

Node.JS

I skip that part since I’ve done the setup before and it’s not really the focus of the project (yet – I plan add some CI and deployments at some point). Simple installation steps are described in the project's README.

Antlr

The Antlr setup is twofold: installing the code generator and the antlr runtime.

There is a really useful Visual Studio code extension which comes with a bundled jar binary. For me, syntax highlighting and generating the .js files worked very well with it. Generate-on-save will produce updated source files right into the project. The more direct way to generate the files is to install the jar yourself and call it on the command line with some parameters.

The grammar

First thing I was interested in was to setup a grammar for a pretty basic FinTS element called segmentkopf.

A “segmentkopf” looks like this:

HIBNK:1:1:1

The grammar for it:

grammar FinTS;

segmentkopf  : segmentkennung DEG_SEP segmentnummer DEG_SEP segmentversion (DEG_SEP bezugssegment)?;

segmentkennung : DT_AN+ ;
segmentnummer : DT_num ;
segmentversion : DT_num ;
bezugssegment : DT_num ;

DT_AN : [A-Z] ;

// H.1.1 Syntaxzeichen
DEG_SEP : ':' ;

// // Datenformate
DT_num : '0' | [1-9] [0-9]* ; // Numerisch, keine fuehrenden Nullen

(forgive the German names and comments, but since the spec is written inGerman, too, I’ll stick to its terms, following a domain driven approach)

Grammar rules start with lowercase, lexer expressions with uppercase letters.

I decided to create rules for the seemingly trivial numbers like segmentnummer because

it will make my experimental parse tree more interesting
it will give more possiblities to visually mark or even validate the field values.

Some actual JavaScript code

After some initial node module related trouble (more on that later), I came up with the following:

var antlr4 = require('antlr4/index');
var FinTSLexer = require('./antlr/FinTSLexer.js').FinTSLexer;
var FinTSParser = require('./antlr/FinTSParser.js').FinTSParser;

function getParser(input) {
    const chars = new antlr4.InputStream(input);
    const lexer = new FinTSLexer(chars);
    const tokens = new antlr4.CommonTokenStream(lexer);
    const parser = new FinTSParser(tokens);
    parser.buildParseTrees = true;
    return parser;
}

module.exports.getParser = getParser;

The module export offers a parser object for some input that one can pass in. Let’s see it in use:

var http = require('http');
var fints = require('./fints.js');

http.createServer(function (req, res) {
  res.writeHead(200, {'Content-Type': 'text/plain'});
  const input = "HIBNK:1:1:1"
  res.write(`input: ${input} `);

  const parser = fints.getParser(input);
  const treeSegmentkopf = parser.segmentkopf();
  const treeString = treeSegmentkopf.toStringTree();
  
  res.end(`unmodified parse tree for 'segmentkopf' rule via toStringTree():\n${treeString}\n`);
}).listen(3101);

will give the following output:

input: HIBNK:1:1:1
unmodified parse tree for 'segmentkopf' rule via toStringTree():
([] ([10] H I B N K) : ([12] 1) : ([14] 1) : ([16] 1))

So one can clearly see that the parser is doing its job of creating the tree structure. The builtin toStringTree() method provides a simple representation of it. Still, the names [10], [12] and [14] are not really usable for displaying insightful segment information. Turns out, these numbers do not map to a matched rule but to accepting states of the internal DFA.

For getting information from the nodes of the parse tree and build something from what, we will have to traverse the tree and perform some action. ANTLR offers two builtin ways for that: implementations of the Visitor and the Listener pattern. Both rely primarily on overriding empty default methods in autogenerated code, just where the specific behavior is wanted.

The two main differences between the two are

that methods called by a visitor can return values
in a visitor, node methods are responsible for traversing the tree further down or wherever they want to go, e.g. certain parts of the tree can be skipped.

I wondered what the best way for my purpose could be and how to implement that in JavaScript. That is, I did not expect to get confronted with JavaScript inheritence problems so quickly. How would I make a visitor call an overridden method? How could I solve the problem of not implementing the same thing for every grammar rule?

After digging into _prototype_ and the like, turns out I don’t really need that. The ANTLR wiki JavaScript page describes a visitor that does not use autogenerated visitor code nor rely on specialized classes. (more on that later) (If you want to see some JavaScript code for a visitor, see this SO post)

I’ll skip some intermediate results and show you the current state of things:

// .. init and parsing code as before

class Visitor {
    visitChildren(ctx) {
      if (!ctx) {
        return;
      }

      if (!ctx.children) {
        return;
      }

      let tagName = 'unknownTag'
      if (ctx instanceof ParserRuleContext) {
        const index = ctx.ruleIndex
        tagName = parserRules[ctx.ruleIndex];
        let result = `\n<${tagName}>`;

        ctx.children.forEach(child => {
          const childText = child.getText();
          if (child.children && child.children.length != 0) {
            const childStr = child.toString();
            result += child.accept(this);
          } else {
            if (child instanceof TerminalNode) {
              const index = child.getSymbol().type; //why is getType() not defined?
              const symbolName = symbolicNames[index];
              result += ` <${symbolName}>${childText}</${symbolName}> `;
            }
          }
        });

        return result + `</${tagName}>\n`;
      }
    }
  }

  const visitResult = treeSegmentkopf.accept(new Visitor());

What’s happening here? The visitor does a breadth-first search. For every node with children an opening tag is written, the child is visited recursively, then the closing tag is written. Nodes without a child are Terminals which in the current grammar are always represented by its literal (i.e. the letter).

The trick here really is calling the accept() on the rule with the visitor, which in turn calls visitChildren() with changing contexts of the visited rules but passing in the same visitor every time. (It’s a good interview question to ask for the meaning of this here.)

The output of visitResult is

<segmentkopf>
<segmentkennung> <DT_AN>H</DT_AN>  <DT_AN>I</DT_AN>  <DT_AN>B</DT_AN>  <DT_AN>N</DT_AN>  <DT_AN>K</DT_AN> </segmentkennung>
 <DEG_SEP>:</DEG_SEP> 
<segmentnummer> <DT_num>1</DT_num> </segmentnummer>
 <DEG_SEP>:</DEG_SEP> 
<segmentversion> <DT_num>1</DT_num> </segmentversion>
 <DEG_SEP>:</DEG_SEP> 
<bezugssegment> <DT_num>1</DT_num> </bezugssegment>
</segmentkopf>

I’m pretty happy with that. Plus sides:

It shows me structured output that has the metainformation I want.
All parser and parsing process information is cut off.
I’m optimistic it’s easy to adapt to html or other structured format
I don’t have to interact with the autogenerated code by ANTLR and its objects on an inheritance level.

For special-casing some output for specific nodes in the future, I can write methods for a listener that will annotate the tree, run the listener and let it pick up that information afterwards in a second run, integrating the new information into the output.

Impediments and solved problems

Coming up with that was nontrivial for me. I believe it is the unfortunate combination of me being rather new to JavaScript and ANTLR not really treating JavaScript as first class citizen as the project claims.

The default ANTLR npm installation put version 4.9 into place and that uses ES6 modules. What I have done so far with node.js uses CommonJS modules and the module.exports that works differently than the export/import of ES6. Just switching a few keywords did not do the trick and I didn’t feel like investigating the SyntaxError: Cannot use import statement outside a module for too long. I found a project that ran into this trouble after upgrade and found the handling of the ANTLR github issue quite unhelpful, finally deciding to get rid of ANTLR in their project completely. My solution was to postpone the solution and go with version 4.8 which does not have the problem and behaves as expected.

Second, being new to JavaScript, typical API and debugging, it put some extra burden on me when I tried to translate into JavaScript example code and API docs that are primarily available for Java. It’s not my final judgment but I feel that JavaScript is not completely up to date with all features or close to the API as it is claimed. That may be oversight, just not updated API or difference due to a more JavaScript typical API design. One example are the token and rule names which make up my output. There’s written stuff about vocabulary, getRuleNames() and more, and in the end, JavaScript will just put these names in an array of the parser. For a rulename, you get the index with ctx.ruleIndex, for a terminal with getSymbol().type, resembling the Java a bit in the latter case. That was just confusing and I’m still not quite certain if that’s the way things are intended. But due to lack of official JavaScript documentation, that’s what I’m using now. Btw, the ANTLR npm module does not come with API docs so Visual Studio Code could not offer useful API information.

Finally, the introductory for the JavaScript part helped my purpose a lot but it’s not code you will come up with reading the docs or the book, I feel. Ah, the book. It’s the official documentation and introduction to ANTLR and without it, one misses quite a few things about it. It’s useful to get your hands on a copy if you intend to work with ANTLR seriously. But then again, the accept() approach is maybe too new, it is not mentioned in the book at all.

Less relevant notes:

Not related to my solution but I stumbled across reports where C++ based parser behaved not so well (multitudes slower than Java implementation) which is a bit off-putting. It would be uncool to experience performance issues in otherwise blazing fast C++ if I were to implement a larger grammar there. (I probably won’t.) The bug report was initially declined as a usage problem rather than something ANTLR was responsible for. It was unclear last time I checked.
The build and release process seems to be not so perfect, I read about some trouble for the C# release I believe where I really disliked the tone when things didn’t go as expected.

After all that ranting I admit that the project and its code will most likely be stable enough for my needs. After all, I might only need the parser tree and the rules won’t get too complicated. The community is active, I see my first impressions confirmed: with github, a mailing list and stackoverflow there are several entry points for help giving otherwise undocumented solutions.

So after finally putting all this together I’m looking forward to continue with coding. I plan to create some colorized html.