Copyright 2021 Adobe. All Rights Reserved. This software is licensed as OpenSource, under the Apache License, Version 2.0. This license is available at: http://opensource.org/licenses/Apache-2.0.
Document version 0.2 Last updated 24 May 2021
In 2021 the code in makeotfexe
that parses and processes feature files was
upgraded from a pccts (Antlr 1) implementation in C to an Antlr 4
implementation in C++. One reason for the change was to provide a more
contemporary and better documented context for implementation of future changes
to the format of feature files, including additional commands for variable
fonts. This document discusses the new system, sometimes in contrast to the
previous system, to aid those future changes.
There is a healthy amount of Antlr 4 documentation here, and also a book you can buy.
The pccts-based parser had a single source file with lexigraphic tokens, the
feature file grammar, and snippets of C code (mostly to functions defined in
feat.c
) to process the files. In the Antlr 4 implementation the lexer is
defined primarily in FeatLexerBase.g4
and the file grammar in
FeatParser.g4
. Neither of these include target-language code and therefore
could be used for feature file parsing in other target languages such as Java
or Python.
The additional file FeatLexer.g4
imports FeatLexerBase.g4
and has a small
amount of C++ code to recognize anon
blocks. It is this file that defines
the actual Lexer and the generated files are accordingly FeatLexer.h
and
FeatLexer.cpp
.
The parser is similarly implemented by FeatParser.h
and FeatParser.cpp
and
there is also an abstract FeatParserVisitor
class and a
FeatParserBaseVisitor
implementation generated by Antlr 4. FeatParser.h
is
the most useful file to refer to; it has the naming conventions and internal
structure of each of the nodes of the parse tree.
All of the derived files can currently be regenerated by running python
BuildGrammar.py
in the hotconv
source directory. This assuming that antlr4
is installed, in your path, and has a version matching the one hard-coded in
the script. The command also has a -c
option that removes all the generated
files. (However, because the files are tracked in git
you typically want to
include them in any updates.
The root CMakeLists.txt
file has a line like set(ANTLR4_TAG tags/4.9.2)
.
hotconv/BuildGrammar.py
has a matching line antlr_version = "4.9.2"
. These
should be updated together to ensure the runtime (which is pulled down from
the Antlr 4 git repository) matches the generated files. When you update the
version remember to clean and regenerate the grammar.
The new files FeatCtx.h
and FeatCtx.cpp
correspond to the old feat.c
.
This C++ class mostly consists of utility and adapter code that should be
recognizable to people familiar with the previous system. The new files
FeatVisitor.h
and FeatVisitor.cpp
correspond to the snippets of C in
the old featgram.g
, but in contrast with FeatCtx
the new code is quite
different.
FeatVisitor
and the Visitor SemanticAntlr 4 can be used in different ways but its authors recommend using the
parser to build a parse tree and then traversing that tree with code written in
the target language. Antlr can optionally produce “listeners” and “visitors”
and the makeotfexe
code uses the latter. In effect there is one virtual
method corresponding to each of the types of node in the tree. The default
implementation for a node just calls the method for each child node passing it
the child context. One processes the tree by replacing the default
implementation for a given node with one to do the processing.
Here, for example, is the grammar of a featureBlock
:
featureBlock:
FEATURE starttag=tag USE_EXTENSION? LCBRACE
featureStatement+
RCBRACE endtag=tag SEMI
;
This is the corresponding Context
class in FeatParser.h
:
class FeatureBlockContext : public antlr4::ParserRuleContext {
public:
FeatParser::TagContext *starttag = nullptr;
FeatParser::TagContext *endtag = nullptr;
FeatureBlockContext(antlr4::ParserRuleContext *parent, size_t invokingState);
virtual size_t getRuleIndex() const override;
antlr4::tree::TerminalNode *FEATURE();
antlr4::tree::TerminalNode *LCBRACE();
antlr4::tree::TerminalNode *RCBRACE();
antlr4::tree::TerminalNode *SEMI();
std::vector<TagContext *> tag();
TagContext* tag(size_t i);
antlr4::tree::TerminalNode *USE_EXTENSION();
std::vector<FeatureStatementContext *> featureStatement();
FeatureStatementContext* featureStatement(size_t i);
virtual antlrcpp::Any accept(antlr4::tree::ParseTreeVisitor *visitor) override;
};
And this is a simplified version of the present visitor method for featureBlock
nodes:
antlrcpp::Any FeatVisitor::visitFeatureBlock(FeatParser::FeatureBlockContext *ctx) {
if ( stage == vExtract ) {
Tag t = checkTag(ctx->starttag, ctx->endtag);
TOK(ctx);
fc->startFeature(t);
if ( ctx->USE_EXTENSION() != nullptr )
fc->flagExtension(false);
}
for (auto i : ctx->featureStatement())
visitFeatureStatement(i);
if ( stage == vExtract ) {
TOK(ctx->endtag);
fc->endFeature();
}
return nullptr;
}
First note the antlr4::tree::TerminalNode
pointers in the context object.
These can be checked for nullptr
when keywords are optional (as with
USE_EXTENSION
), otherwise they can generally be ignored.
checktag
in visitFeatureBlock()
is a utility method that verifies the start
and end tags are equal (or outputs an error) and returns the tag.
startFeature()
is a FeatCtx
method that prepares for new feature statements
and endFeature()
is a corresponding method that wraps up feature processing.
In between the method calls visitFeatureStatement()
on each child
featureStatement
node in order, fulfilling its role as a “visitor”.
The stage
guards represent two stages of tree processing: vInclude
and
vExtract
. vInclude
only involves opening and parsing included feature
files, and therefore the vInclude
processing stage needs to reach each
include node without doing anything else. The parse tree is processed in
vExtract
.
The remaining unreferenced syntax (besides the antlrcpp::Any
return value,
which is an unused Antlr-ism) is the TOK()
method. This is actually
overloaded with a method and two methods templates in FeatVisitor.h
that
accept and return tree nodes or tokens. TOK
should be called on a relevant
child node, and sometimes the current node (as in TOK(ctx)
) before calling
out to a FeatCtx()
method to set the token used to report the line number and
character offset of a warning or error.
If you need to add another context that supports an include directive just
follow the model of one of the existing contexts. Don’t forget to add a new
EOF
-including File
node toward the bottom of FeatParser.g4
.
v0.2 [24 May 2021]: Update when feature complete
v0.1 [11 May 2021]: First version