afdko

MakeOTFEXE Feature File Parser Notes

Document version 0.2 Last updated 24 May 2021

1. Introduction

In 2021 the code in makeotfexe that parses and processes feature files was upgraded from a pccts (Antlr 1) implementation in C to an Antlr 4 implementation in C++. One reason for the change was to provide a more contemporary and better documented context for implementation of future changes to the format of feature files, including additional commands for variable fonts. This document discusses the new system, sometimes in contrast to the previous system, to aid those future changes.

There is a healthy amount of Antlr 4 documentation here, and also a book you can buy.

2. Antlr Files

The pccts-based parser had a single source file with lexigraphic tokens, the feature file grammar, and snippets of C code (mostly to functions defined in feat.c) to process the files. In the Antlr 4 implementation the lexer is defined primarily in FeatLexerBase.g4 and the file grammar in FeatParser.g4. Neither of these include target-language code and therefore could be used for feature file parsing in other target languages such as Java or Python.

The additional file FeatLexer.g4 imports FeatLexerBase.g4 and has a small amount of C++ code to recognize anon blocks. It is this file that defines the actual Lexer and the generated files are accordingly FeatLexer.h and FeatLexer.cpp.

The parser is similarly implemented by FeatParser.h and FeatParser.cpp and there is also an abstract FeatParserVisitor class and a FeatParserBaseVisitor implementation generated by Antlr 4. FeatParser.h is the most useful file to refer to; it has the naming conventions and internal structure of each of the nodes of the parse tree.

2.1 Generating

All of the derived files can currently be regenerated by running python BuildGrammar.py in the hotconv source directory. This assuming that antlr4 is installed, in your path, and has a version matching the one hard-coded in the script. The command also has a -c option that removes all the generated files. (However, because the files are tracked in git you typically want to include them in any updates.

2.2 Antlr Runtime and Versions

The root CMakeLists.txt file has a line like set(ANTLR4_TAG tags/4.9.2). hotconv/BuildGrammar.py has a matching line antlr_version = "4.9.2". These should be updated together to ensure the runtime (which is pulled down from the Antlr 4 git repository) matches the generated files. When you update the version remember to clean and regenerate the grammar.

3 Other Files

The new files FeatCtx.h and FeatCtx.cpp correspond to the old feat.c. This C++ class mostly consists of utility and adapter code that should be recognizable to people familiar with the previous system. The new files FeatVisitor.h and FeatVisitor.cpp correspond to the snippets of C in the old featgram.g, but in contrast with FeatCtx the new code is quite different.

4. `FeatVisitor` and the Visitor Semantic

Antlr 4 can be used in different ways but its authors recommend using the parser to build a parse tree and then traversing that tree with code written in the target language. Antlr can optionally produce “listeners” and “visitors” and the makeotfexe code uses the latter. In effect there is one virtual method corresponding to each of the types of node in the tree. The default implementation for a node just calls the method for each child node passing it the child context. One processes the tree by replacing the default implementation for a given node with one to do the processing.

Here, for example, is the grammar of a featureBlock:

featureBlock:
    FEATURE starttag=tag USE_EXTENSION? LCBRACE
    featureStatement+
    RCBRACE endtag=tag SEMI
;

This is the corresponding Context class in FeatParser.h:

  class  FeatureBlockContext : public antlr4::ParserRuleContext {
  public:
    FeatParser::TagContext *starttag = nullptr;
    FeatParser::TagContext *endtag = nullptr;
    FeatureBlockContext(antlr4::ParserRuleContext *parent, size_t invokingState);
    virtual size_t getRuleIndex() const override;
    antlr4::tree::TerminalNode *FEATURE();
    antlr4::tree::TerminalNode *LCBRACE();
    antlr4::tree::TerminalNode *RCBRACE();
    antlr4::tree::TerminalNode *SEMI();
    std::vector<TagContext *> tag();
    TagContext* tag(size_t i);
    antlr4::tree::TerminalNode *USE_EXTENSION();
    std::vector<FeatureStatementContext *> featureStatement();
    FeatureStatementContext* featureStatement(size_t i);


    virtual antlrcpp::Any accept(antlr4::tree::ParseTreeVisitor *visitor) override;

  };

And this is a simplified version of the present visitor method for featureBlock nodes:

antlrcpp::Any FeatVisitor::visitFeatureBlock(FeatParser::FeatureBlockContext *ctx) {
    if ( stage == vExtract ) {
        Tag t = checkTag(ctx->starttag, ctx->endtag);
        TOK(ctx);
        fc->startFeature(t);
        if ( ctx->USE_EXTENSION() != nullptr )
            fc->flagExtension(false);
    }

    for (auto i : ctx->featureStatement())
        visitFeatureStatement(i);

    if ( stage == vExtract ) {
        TOK(ctx->endtag);
        fc->endFeature();
    }
    return nullptr;
}

First note the antlr4::tree::TerminalNode pointers in the context object. These can be checked for nullptr when keywords are optional (as with USE_EXTENSION), otherwise they can generally be ignored.

checktag in visitFeatureBlock() is a utility method that verifies the start and end tags are equal (or outputs an error) and returns the tag. startFeature() is a FeatCtx method that prepares for new feature statements and endFeature() is a corresponding method that wraps up feature processing. In between the method calls visitFeatureStatement() on each child featureStatement node in order, fulfilling its role as a “visitor”.

The stage guards represent two stages of tree processing: vInclude and vExtract. vInclude only involves opening and parsing included feature files, and therefore the vInclude processing stage needs to reach each include node without doing anything else. The parse tree is processed in vExtract.

The remaining unreferenced syntax (besides the antlrcpp::Any return value, which is an unused Antlr-ism) is the TOK() method. This is actually overloaded with a method and two methods templates in FeatVisitor.h that accept and return tree nodes or tokens. TOK should be called on a relevant child node, and sometimes the current node (as in TOK(ctx)) before calling out to a FeatCtx() method to set the token used to report the line number and character offset of a warning or error.

5. Include directives

If you need to add another context that supports an include directive just follow the model of one of the existing contexts. Don’t forget to add a new EOF-including File node toward the bottom of FeatParser.g4.

6. Document revisions

v0.2 [24 May 2021]: Update when feature complete

v0.1 [11 May 2021]: First version

This site is open source. Improve this page.