The open-source ast package provides a wealth of capabilities for parsing, analyzing, and generating Python code. This article discusses these capabilities and demonstrates how they can be used in code. It also introduces the capabilities of the astor package.
1. Introduction
As StackOverflow makes clear, Python's popularity has risen dramatically in recent years. As a result, more software tools need to be able to read and analyze Python code. The open-source ast package provides many capabilities for this purpose, and the goal of this article is to introduce its features.
AST stands for abstract syntax tree, and a later section will explain what these trees are and why they're important. The ast package makes it possible to read Python code into ASTs, and each node of an AST is represented by a different Python class.
The last part of this article discusses the astor (AST observe/rewrite) package. This provides helpful functions for reading and writing ASTs.
2. Two Fundamental Functions
Before I get into the details of Python analysis, I'd like to start with a simple example that shows why the ast package is so useful. There are two fundamental functions to know:
parse(code_str)
- creates an abstract syntax tree from a string containing Python code dump(node, annotate_fields=True, include_attributes=False, *, indent=None)
- converts an abstract syntax tree into a string
To demonstrate how these methods are used, the following code calls ast.parse
to create an abstract syntax tree for a simple for
loop. Then it calls ast.dump
to convert the tree
into a string.
tree = ast.parse("for i in range(10):\n\tprint('Hi there!')")
print(ast.dump(tree, indent=4))
When this code is executed, it produces the following result:
Module(
body=[
For(
target=Name(id='i', ctx=Store()),
iter=Call(
func=Name(id='range', ctx=Load()),
args=[
Constant(value=10)],
keywords=[]),
body=[
Expr(
value=Call(
func=Name(id='print', ctx=Load()),
args=[
Constant(value='Hi there!')],
keywords=[]))],
orelse=[])],
type_ignores=[])
To a casual coder, this may look like a complete mess. But it's important to anyone trying to build tools that analyze Python code. This output identifies the structure of the code, and the structure is given in the form of an abstract syntax tree.
3. Abstract Syntax Trees (ASTs)
To make sense of the parser's result, it's important to understand abstract syntax trees, or ASTs. These trees embody the structure of a document's content, whether it's written in a programming language like Python or a natural language like English.
This section explains what ASTs are and then presents the classes that represent nodes of an AST. But before I introduce abstract syntax trees, I'd like to take a step back and explain what tree structures are.
3.1 Tree Structures
When data elements form a hierarchy beginning with a single element, the elements and their relationships can be expressed as a tree. Common trees include organization charts, file navigators, and family trees. Tree structures are frequently encountered in software development, particularly in networking, graphics, and text analysis.
When working with trees, developers rely on a common set of terms:
- Each element in a tree is called a node.
- The topmost element is called the root node.
- If a node is connected to nodes below it, the first node is called a parent node and the connected nodes are the children of the parent.
- Every node except the root has a parent node. A node with one or more children is a branch node and a node without children is called a leaf node.
Figure 1 depicts a simple tree. Node E is the root node and Nodes B, C, and D are its children. Nodes A and F are the children of B and Node G is the child of D. Nodes A, F, C, and G have no children, so they're leaf nodes. The other nodes are branch nodes.
Figure 1: A Simple Tree Hierarchy
Each node in the tree has a depth value that identifies how many connections separate it from the root. In this example, Node E has a depth of 0, Node C has a depth of 1, and Node G has a depth of 2.
3.2 Abstract Syntax Trees (ASTs)
When I was in grade school, we had to analyze sentences using sentence trees. The root node represents the entire sentence and every root has two children: one for the subject and one for the predicate. In a simple sentence, subjects are represented by noun phrases and predicates are represented by verb phrases. Figure 2 presents the tree for the sentence: This sentence is simple.
Figure 2: Example Sentence Tree
In this tree, the leaf nodes contain the individual strings that make up the text. The branch nodes identify the purpose of each leaf node and the role it plays in the sentence.
If you can see how sentence trees represent English sentences, you won't have any trouble understanding how abstract syntax trees represent code written in Python. When ast.parse
analyzes Python code, the root node takes one of four forms:
- module - collection of statements
- function - definition of a function
- interactive - collection of statements in an interactive session
- expression - simple expression
Figure 3 illustrates the AST for the simple Python for
loop presented earlier. The root node is a module.
Figure 3: Example Python AST
Almost every Python AST I've encountered has a module as its root node. A module is made up of one or more statements, and most types of statements are made up of one or more expressions. The following discussions explore the topics of statements and expressions.
3.2.1 Statements
In the preceding AST, the module contains a single statement that represents a for
loop. In addition to for
loops, AST statements can represent function definitions, class definitions, while
loops, if
statements, return
statements, and import
statements.
Each statement node has one or more children, and the number and types of its children change depend on the statement's type. For example, a function definition has at least four children: an identifier, arguments, a decorator list, and a set of statements that form its body. To see this, the following code parses Python code that defines a function named foo
.
tree = ast.parse("def foo():\n\tprint('Hello!')")
print(ast.dump(tree, indent=4))
The second line creates the following string from the AST:
Module(
body=[
FunctionDef(
name='foo',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
defaults=[]),
body=[
Expr(
value=Call(
func=Name(id='print', ctx=Load()),
args=[
Constant(value='Hello!')],
keywords=[]))],
decorator_list=[])],
type_ignores=[])
Reading from left to right, it's clear that the root node is a module and its child is a function definition. The function definition has four children, and the child representing the body has one child because the function's body contains one line of code.
Class definitions are particularly important, and each has five children: a name, zero or more base classes, zero or more keywords, zero or more statements, and zero or more decorators. Each method in a class is represented by a function definition statement.
To demonstrate this, consider the following simple class definition:
class Example:
def __init__(self):
self.prop = 4
def printProp(self):
print(self.prop)
The following code parses this class definition to obtain an AST.
tree = ast.parse("class Example:\n\tdef __init__(self):\n\t\tself.prop =
4\n\n\tdef printProp(self):\n\t\tprint(self.prop)")
Rather than print out the entire AST, Figure 4 illustrates its top-level nodes.
Figure 4: AST for a Class Definition
Many statements, such as return
statements and import
statements, are very simple. But other statements, such as if
statements and assignment statements, are composed of child structures called expressions. I'll discuss these next.
3.2.2 Expressions
We're all familiar with mathematical expressions like 2+2 and 8*9, but expressions in a Python AST are harder to pin down. There's no clear distinction between a statement and an expression, and in fact, an expression can be a statement. In a Python AST, an expression can take one of several different forms, including the following:
- binary, unary, and boolean operations
- comparisons involving values and containers
- function calls (not function definitions)
- containers (lists, tuples, dicts, sets)
- attributes, subscripts, and slices
- constants and names (strings)
The last bullet is important. Almost every leaf node in an AST will be a name or a constant, so it's important to distinguish between the two expressions. A name is an identifier, such as a function name, class name, or variable name. A constant is any value that isn't an identifier.
To see how expressions are parsed, it helps to look at an example. The following code parses a simple mathematical expression and prints its AST.
tree = ast.parse("(x+3)*5")
print(ast.dump(tree, indent=4))
The printed AST is given as follows:
Module(
body=[
Expr(
value=BinOp(
left=BinOp(
left=Name(id='x', ctx=Load()),
op=Add(),
right=Constant(value=3)),
op=Mult(),
right=Constant(value=5)))],
type_ignores=[])
This module contains a single statement, and that statement is an expression. The expression consists of two binary operations: addition and multiplication. The variable x is identified by a name node and the two numeric values are identified by value nodes.
3.2 AST Classes
Every node type in the Python AST has a corresponding class in the ast package. Modules are represented by instances of the Module
class and expressions are represented by Expr
instances. Function definitions are represented by FunctionDef
instances and class definitions are represented by ClassDef
instances.
Every child of a node corresponds to a property of the corresponding class. In Figure 4, the class definition node has children named name, body, bases, keywords, and decorator list. To store this information, the ClassDef
class has properties named name
, body
, bases
, keywords
, and decorator_list
.
Every node class extends from the central AST
class. This has a handful of useful properties that provide information about the node:
_fields
- a tuple containing the names of the node's children (which correspond to class properties) lineno
- first line number containing the node endlineno
- last line number containing the node colno
- first column containing the node endcolno
- last column containing the node
For example, the following code lists the children of an if
statement:
print(ast.If._fields)
The printed output is ('test', 'body', 'orelse')
.
There isn't a lot of documentation on the node classes and their constructors. But you can see how a node is constructed by looking at the output of the dump
method. To convert this output into a constructor, simply preface each node class with the ast
prefix. For example, the following code relies on the preceding output to define an expression containing two binary operations:
firstOp = ast.BinOp(left=ast.BinOp(left=ast.Name(id='x', ctx=ast.Load()),
op=ast.Add(), right=ast.Constant(value=3)))
secondOp = ast.Mult()
e = ast.Expr(value=firstOp, op=secondOp, right=ast.Constant(value=5))
Once you understand how to instantiate node classes, you can programmatically construct ASTs. Then you can generate Python code from an AST using the ASTOR package, which I'll discuss next.
4. Using ASTOR
To augment the capabilities of the ast package, Berker Peksag released astor, which stands for AST Observe/Rewrite. If you have pip available, you can install astor with the command pip install astor
. As of this writing, the current version is 0.8.1.
astor provides a number of useful classes and functions that simplify working with Python ASTs. Table 1 lists six of these functions and provides a description of each.
Table 1: Functions of the astor Package Function | Description |
to_source(ast, indent_with=' '*4,
add_line_information=False)
| Convert an AST to Python code |
code_to_ast(codeobj) | Recompile a module into an AST and
extract a sub-AST for the function |
parse_file(file) | Parse a Python file into an AST |
dump_tree(node, name=None,
initial_indent='',
indentation=' ',
maxline=120, maxmerged=80)
| Pretty print an AST with indentation |
strip_tree(node) | Recursively remove attributes from an AST |
iter_node(node, unknown=None) | Iterates over an AST node |
The first function, to_source
, is particularly helpful because it accepts an AST (or a node) and prints Python code. To demonstrate this, the following code calls ast.parse
to obtain an AST for a function definition. Then it calls astor.to_source
to convert the AST to Python code.
tree = ast.parse("def foo():\n\tprint('Hello!')")
print(astor.to_source(tree))
The output of the second line is given as follows:
def foo():
print('Hello!')
In this manner, a Python script can generate Python code programmatically. This can be very helpful when you need to translate text from one language into Python.
5. History
- 22nd August, 2021: Initial publication
- 24th August, 2021: Added link to ASTOR