ANTLR Parsing and C++, Part 2: Building a Python Parser

Matt Scarpino

5.00/5 (3 votes)

1 Aug 2021CPOL11 min read

11.9K

556

Explains how to generate Python parsing code with ANTLR and use the code to create a Python parser in C++

The preceding article presented the basics of ANTLR parser generation and explained how to access the generated code in C++. This article goes further and shows how to use ANTLR to code an advanced Python parser in C++.

1 Introduction

The preceding article explained how ANTLR-generated code can be accessed in a C++ application. The grammar was trivially simple, so the generated code wasn't suitable for demonstrating ANTLR's advanced capabilities. This article generates parsing code from a Python grammar and presents an application that uses the code to analyze a Python script.

This article presents a number of advanced features related to ANTLR grammars. Then it explains how to analyze parse trees using listeners and visitors.

2 Parsing Python

You can find several grammar files on ANTLR's Github page. In the Python folder, you'll find different grammars related to the Python programming language. This article is concerned with the grammar file in the python/python3-cpp folder, Python3.g4. This is provided in this article's Python3_grammar.zip file, and you can generate code from it by entering the following command:

java -jar antlr-4.9.2-complete.jar -Dlanguage=Cpp -visitor Python3.g4

The -visitor flag is new, and it tells ANTLR to generate code for visitor classes. I'll explain what visitor classes are later in this article.

Right now, I'd like to present an overview of Python's grammar. Then I'll explain how to build this article's example application, which analyzes the parse tree of a short program using a listener and a visitor.

2.1 Fundamental Parsing Rules

Python has become immensely popular and a large part of the reason is simplicity. Python is easy to read, easy to code, and easy to debug. It's also easy to parse, which makes it ideal for advanced ANTLR development. This discussion focuses on parsing the code in the test.py file, which is presented in Listing 1.

Listing 1: test.py

Python

letters = ["a", "b", "c"]
for letter in letters:
    print(letter)
print("Done")

As the parser analyzes this text, it forms a parse tree whose nodes represent grammar rules. Figure 1 shows what the upper part of the parse tree looks like.

Figure 1: Example Python Parse Tree (Abridged)

To understand the nodes of this tree, you need to be familiar with the rules of the Python grammar. An input file consists of statements and newline characters. Statements can be simple or compound, and a simple statement consists of one or more small statements followed by newline characters. Python3.g4 expresses these relationships in the following way:

file_input: (NEWLINE | stmt)* EOF;

stmt: simple_stmt | compound_stmt;

simple_stmt: small_stmt (';' small_stmt)* (';')? NEWLINE;

small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt);

The test.py code has three statements: two simple statements and one compound statement. The following discussion presents the rules for simple statements and then introduces the rules for compound statements.

2.1.1 Simple Statements

The first line of test.py is a simple statement. To be specific, it's an expression represented by the expr_stmt rule:

expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) | 
    ('=' (yield_expr|testlist_star_expr))*);

In the first line of test.py, the letters variable name and the assigned array are both identified by the testlist_star_expr rule:

testlist_star_expr: (test|star_expr) (',' (test|star_expr))* (',')?;

The test rule recognizes any general expression and is defined in the following way:

test: or_test ('if' or_test 'else' test)? | lambdef;

test_nocond: or_test | lambdef_nocond;

or_test: and_test ('or' and_test)*;

and_test: not_test ('and' not_test)*;

not_test: 'not' not_test | comparison;

comparison: expr (comp_op expr)*;

If the test node doesn't contain if, and, or, or not, it boils down to an expr. The expr rules are given as follows:

expr: xor_expr ('|' xor_expr)*;

xor_expr: and_expr ('^' and_expr)*;

and_expr: shift_expr ('&' shift_expr)*;

shift_expr: arith_expr (('<<'|'>>') arith_expr)*;

arith_expr: term (('+'|'-') term)*;

term: factor (('*'|'@'|'/'|'%'|'//') factor)*;

factor: ('+'|'-'|'~') factor | power;

power: atom_expr ('**' factor)?;

atom_expr: (AWAIT)? atom trailer*;

If the expr doesn't contain logical or mathematical operators, it boils down to an atom, which is defined with the following rule:

atom: ('(' (yield_expr|testlist_comp)? ')' |
       '[' (testlist_comp)? ']' |
       '{' (dictorsetmaker)? '}' |
       NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False');

At this point, we can see how the parser will analyze the first line of test.py. The letters variable is represented by the NAME token.

NAME: ID_START ID_CONTINUE*;

In this lexer rule, ID_START is a fragment that identifies all the characters that a Python variable name can use. ID_CONTINUE identifies all the characters that can make up the rest of the name.

On the right side of the assignment, the list of three items is represented by a testlist_comp.

testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* (',')? );

This indicates that a testlist_comp is essentially a comma-separated set of test nodes. We looked at the test rules earlier.

2.1.2 Compound Statements

In the test.py script, the second line of code contains a compound statement. According to the Python grammar, a compound statement can have one of nine forms:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | 
    classdef | decorated | async_stmt;

The for_stmt rule represents for loops. It's defined in the following way:

for_stmt: 'for' exprlist 'in' testlist ':' suite ('else' ':' suite)?;

The exprlist and testlist rules are both straightforward.

exprlist: (expr|star_expr) (',' (expr|star_expr))* (',')?;

testlist: test (',' test)* (',')?;

After the first line of the for loop, the following lines are represented by the suite rule:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT;

This indicates that the for loop must be followed by a simple statement or by one or more indented/dedented statements. The INDENT and DEDENT tokens are particularly interesting, and I'll discuss them shortly.

You can look through the other compound statements in a similar way. You don't need to be an expert on Python or its grammar to follow this article, but the better you understand its grammar, the better you'll understand its parse trees.

2.2 Building the Example Project

To demonstrate Python parsing in C++, this article provides two zip files containing project code. The first zip file, Python3_vs.zip, contains project code for Visual Studio. The second, Python3_gnu.zip, contains code for GNU-based build systems.

Like the project from the preceding article, these projects access ANTLR-generated code through the main.cpp source file. Listing 2 presents its code.

Listing 2: main.cpp

C++

int main(int argc, const char* argv[]) {
    
    // Create an input file stream
    std::ifstream infile("test.py");

    // Create an ANTLR stream from the file stream
    antlr4::ANTLRInputStream input(infile);

    // Create a lexer from the ANTLR stream
    Python3Lexer lexer(&input);

    // Create a token stream from the lexer
    antlr4::CommonTokenStream tokens(&lexer);

    // Create a parser from the token stream
    Python3Parser parser(&tokens);    

    // Associate a visitor with the Suite context
    Python3BaseVisitor visitor;
    std::string str = visitor.visitSuite(parser.suite()).as<std::string>();
    std::cout << "Visitor output: " << str << std::endl;

    // Associate a listener with the file_input context
    std::cout << "Listener output" << std::endl;
    antlr4::tree::ParseTreeWalker walker;
    antlr4::ParserRuleContext* fileInput = parser.file_input();
    Python3BaseListener* listener = new Python3BaseListener();
    walker.walk(listener, fileInput);

    return 0;
}

As in the preceding article, the main function creates an ANTLRInputStream containing the input text. Then it creates a lexer (Python3Lexer) and a parser (Python3Parser). After creating the parser, the function analyzes the parse tree with a visitor (Python3BaseVisitor) and a listener (Python3BaseListener).

This article explains listeners and visitors later on. First, it's important to understand some of the advanced capabilties used in the Python3.g4 grammar file.

3 Advanced Grammar Features

The Python3.g4 grammar file is much more complex than the Expression.g4 grammar presented in the preceding article. In addition to rules and the grammar identification, it contains three new features:

A tokens section that identifies tokens outside of lexer rules
Code blocks following names like @lexer::header and @lexer::members
Code blocks following rule definitions

This section looks at each of these features and explains how they're used.

3.1 The tokens Section

After the grammar identification, the Python3.g4 grammar has the following line:

Python

tokens { INDENT, DEDENT }

This creates two tokens that don't have associated lexer rules. They don't have rules because they require special processing. This processing is defined using actions, which I'll discuss next.

3.2 Named Actions

In Python, code blocks are indicated by indentation, not punctuation. If consecutive lines have the same indentation, the interpreter assumes that they belong to the same code block. The exact number of spaces in the indentation isn't important, but every line in a block must be indented the same number of spaces.

The tokens section creates the INDENT token to represent moving text right and the DEDENT token to represent moving text left. Because of Python's special requirements, these tokens can't be defined with lexer rules. Instead, the grammar needs to provide special code that creates these tokens. This code is contained in curly braces following @lexer::members.

C++

@lexer::members {
  private:
  std::vector<std::unique_ptr<antlr4::Token>> m_tokens;
  std::stack<int> m_indents;
  ...
  
  public:
  ...
}

This code takes care of processing INDENT and DEDENT tokens. It's surrounded by curly braces, so it identifies an action, and will be inserted into the ANTLR-generated code. This action follows @lexer::members, so it will be added to the lexer's properties and functions. Actions can also follow names like @lexer::header, @parser::header, and @parser::members.

If an action follows @header, it will be added to the header region of the generated lexer and parser. If an action follows @members, it will be added to the members of the generated lexer class and the generated parser class.

3.3 Anonymous Actions

Most of the actions in Python3.g4 don't follow names like @parser::header or @lexer::members. Instead, they're located inside or after rules. For example, the NEWLINE rule is given as follows:

NEWLINE: ( {atStartOfInput()}? SPACES | ( '\r'? '\n' | '\r' | '\f' ) SPACES?) { ,,, }

This rule has two actions. The first calls the lexer's atStartOfInput() function during the rule's processing. The second is given by the block of code at the end of the rule. Because the second action is at the end, it will be called after the lexer finds a valid NEWLINE token.

If an action is associated with a lexer rule, its code can access attributes of the rule's token by preceding the token's name with $ and following it with a dot and the attribute. For example, if an action is associated with a rule for the ID token, it can access the token's line with $ID.line and the token's text with $ID.text.

4 Analyzing Parse Trees

Many applications monitor the parser's analysis and perform operations when specific rules are encountered. This is called walking the parse tree. ANTLR provides two mechanisms for walking parse trees: listeners and visitors. Listeners are easier to use but visitors provide more control.

4.1 Listeners

When ANTLR generates code from a grammar, it creates two listener classes: one ending in -Listener and one ending in -BaseListener. Figure 2 illustrates the class hierarchy of the listeners generated from the Python3 grammar.

Figure 2: Listener Class Hierarchy

If you look at the header files for the Python3Listener and Python3BaseListener classes, you'll see that they both have the same set of functions: two functions for each parser rule. For example, the Python3 grammar has a rule called if_stmt, so listener classes have functions called enterIf_stmt and exitIf_stmt. The enter function is called before the rule is processed and the exit function is called after the rule has been processed.

In the Python3Listener class, the enter/exit functions are pure virtual. In the Python3BaseListener class, the functions have empty code blocks. Therefore, if you want to code your own listener class, you should make it a subclass of Python3Listener. If you only want to provide code for a few enter/exit functions, you should modify the Python3BaseListener code.

An example will clarify how this works. In this article's example code, the Python3BaseListener contains a function that's called whenever the parser enters the for_stmt rule:

C++

void Python3BaseListener::enterFor_stmt(Python3Parser::For_stmtContext* forCtx) {

    // Display text of for statement context
    std::cout << "For statement text: " << forCtx->getText() << std::endl;

    // Display text of expr context
    Python3Parser::ExprContext* exprCtx = forCtx->testlist()->test()[0]->or_test()[0]->
                                and_test()[0]->not_test()[0]->comparison()->expr()[0];
    std::cout << "Expr text: " << exprCtx->getText() << std::endl;

    // Display text of atom context
    Python3Parser::AtomContext* atomCtx = exprCtx->xor_expr()[0]->and_expr()[0]->
    shift_expr()[0]->arith_expr()[0]->term()[0]->factor()[0]->power()->atom_expr()->atom();
    std::cout << "Atom text: " << atomCtx->getText() << std::endl;
}

This function accesses different rule contexts associated with the for_stmt. Then it calls getText() to print the string associated with three of the rule contexts.

Once you've written your listener functions, you need to associate the listener with the parser. Creating this association requires four steps:

Create an instance of the antl4::tree::ParseTreeWalker class.
Obtain a pointer to the parser rule context that you want the listener to process.
Create an instance of your listener.
Call the ParseTreeWalker's walk function with the listener and the parser rule context.

The following code, taken from main.cpp, shows how these four operations can be performed.

C++

antlr4::tree::ParseTreeWalker walker;
antlr4::ParserRuleContext* fileInput = parser.file_input();
Python3BaseListener* listener = new Python3BaseListener();
walker.walk(listener, fileInput);

Listeners are easy to use and understand, but they have two limitations. First, applications can't control which nodes are processed. For example, the preceding listener code runs for every for_stmt rule, and the application can't change this. Also, listener functions always return void, so there's no easy way to transfer information to other classes and functions. To make up for these shortcomings, ANTLR provides visitors.

4.2 Visitors

By default, ANTLR doesn't generate code for visitor classes. But if the -visitor flag is added to the generation command, ANTLR will generate four additional source files. For the Python3.g4 grammar, these files are Python3Visitor.h, Python3Visitor.cpp, Python3BaseVisitor.h, and Python3BaseVisitor.cpp.

The Python3Visitor.h file declares a class called Python3Visitor and a series of pure virtual functions. The Python3BaseVisitor.h file declares a subclass of Python3Visitor called Python3BaseVisitor. This subclass has the same functions as Python3Visitor, but these functions have code. If you want to create a new visitor class, it's a good idea to code a custom subclass of Python3Visitor. If you'd rather write code for individual visitor functions, add the functions to Python3BaseVisitor.cpp.

The names of visitor functions all start with visit and end with the name of a parser rule. Each visitor function provides a pointer to the corresponding rule context. For example, the function visitSuite is called to visit the context corresponding to the grammar's suite rule. It provides a pointer to the SuiteContext for each rule processed. The following code presents the definition of visitSuite in Python3BaseVisitor.h.

C++

virtual antlrcpp::Any visitSuite(Python3Parser::SuiteContext *ctx) override {
    return visitChildren(ctx);
}

By default, every visitor function calls visitChildren to access the children of the rule context. Also, every visitor function returns an antlrcpp::Any. The Any struct is based on Boost's Any class, and it can be set equal to any value. To access the value of an Any instance, you need to call its as function with the desired data type. For example, the following code calls visitSuite and accesses its return value as a string.

C++

std::string res = visitSuite(ctx).as<std::string>();

To understand visitor functions, it helps to look at an example. The following function in Python3BaseVisitor.cpp returns the text corresponding to the SuiteContext.

C++

antlrcpp::Any Python3BaseVisitor::visitSuite(Python3Parser::SuiteContext* ctx) {
    return ctx->getText();
}

To associate this visitor with the parser, the main function creates an instance of the visitor and calls its visitSuite function with the parser's SuiteContext. Then it casts the result to a string and prints it to standard output.

C++

Python3BaseVisitor visitor;
std::string str = visitor.visitSuite(parser.suite()).as<std::string>();
std::cout << "Visitor output: " << str << std::endl;

This example isn't very interesting, but it demonstrates two important features of visitors. First, the application can control which rules are visited by the visitor. Second, it shows how the return value of a visitor function can be set and accessed.

History

1^st August, 2021: Article was first submitted

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)