Preface
Is there an area where a new programming language developed from scratch can be applied? For the first glance, there are no requirements which cannot be covered by the existing languages. However, such areas exist, e.g., systems where parallel calculation is required. It is remarkable that the computer society has been looking at a language to support grammar parsers itself! BNF (Backus Naur Form) notation and its derivatives (ABND, EBNF, etc.) are widely used for formal specifications of programming languages.
Although such specifications are really good for both requirements and design, they do not help for parser programming. Another approach is to generate code for parsers. The most famous tool is yacc (Yet Another Compiler Compiler) the precompile routine which can generate C parser code from syntactic definition in BNF-like form. This solution is considered as heavy and the software community has been looking for something easier.
For C++, the approach is obvious: a lite solution should be presented as embedded domain specific language (EDSL). It means BNF-like statements have to be mapped to both C++ keywords and overloaded operators. One known example is boost::spirit
template library. This paper introduces the BNFLite C++ template library and demonstrates how to create the simplest byte-code formula compiler using this tool.
Introduction to BNFLite
The BNFLite implements embedded domain specific language approach for grammar specifications when all "production rules" are constructed as instances of C++ classes concatenated by means of overloaded arithmetic operators. For example, the integer number in BNF specs can be described as:
<digit> ::= <0> | <1> | <2> | <3> | <4> | <5> | <6> | <7> | <8> | <9>
<number> ::= <digit> | <digit> <number>
This representation can be written in C++ using BNFlite approach:
#include "bnflite.h" // include source code lib
using namespace bnf;
Lexem Digit = Token("0") | "1" | "2" | "4" | "5" | "6" | "7" | "8" | "9";
Lexem Number;
Number = Digit | Digit + Number;
Execution of this C++ code produces an internal date structure for further parsing procedures.
Requirements for Formula Compiler
Let's specify base requirements to the language of the formula compiler to be developed.
- Byte-code compiler shall be founded on expression like language:
- The following base types shall be supported:
<INTEGER>
| <FLOAT>
| <STRING>
- The compiler shall support unary negative operation and basic binary arithmetic operations
- The compiler shall be able to generate code to call C++ functions defined in body program
This is enough for design of simple programming language, which can be described in Backus-Naur Form.
Formal BNF term is called "production rule". Each rule except "terminal" is a conjunction of a series of more concrete rules or terminals: production_rule ::= <rule_1>...<rule_n> | <rule_n_1>...<rule_m>
. So, the designed expression-like language can be presented by the following terms:
<operand> ::= <INTEGER> | <FLOAT> | <STRING>
<operator> ::= "+" | "-" | "*" | "/"
<arguments> ::= <EMPTY> | <expression> | < arguments> "," <expression>
<function> ::= <IDENTIFIER> "(" < arguments > ")"
<expression> ::= <operand> | <function> | "-"
<expression> | <expression> <operator> <expression> | "(" <expression> ")"
Such decryption is formally correct but it has two known issues. Firstly, binary operator precedence is not presented. Secondly, it contains redundant number of recursions which can be substituted by repetitions.
Augmented BNF specifications introduce constructions like <a>*<b><element>
to support repetition where <a>
and <b>
imply at least <a>
and at most <b>
occurrences of the element. For example, 3*3<element>
allows exactly three and 1*2<element>
allows one or two. *<element>
simplified construction allows any number (from 0 to infinity). Alternatively, 1*<element>
encloses an optional element (at least one) and it can be optionally presented as [element]
.
Full specifications of the formula compiler language in ABNF form are not so complex:
ALPHA = 'A'-'Z' | 'a'-'z'
DIGIT = '0'-'9'
digit1_9 = '1'-'9'
string = '\"' *( ALPHA | DIGIT | '_' ) '\"'
e = 'e' | 'E'
exp = e ['-' | '+'] 1*DIGIT
frac = '.' 1*DIGIT
int = '0' | ( digit1_9 *DIGIT )
number = [ '-' ] int [ frac ] [ exp ]
operand = int | number | string
operator = '+' | '-' | '*' | '/'
identifier = (ALPHA | '_') *( ALPHA | DIGIT | '_' )
arguments ::= expression *( ',' expression )
function ::= identifier '(' [arguments] ')'
elementary ::= operand | function | ('-' elementary) | ('(' expression ')')
primary = elementary *(('/' elementary) | ('*' elementary))
expression = primary *(('-' primary) | ('+' primary))
These specs are more computer friendly than previous ones.
Short Design Notes
The simple byte-code compiler is a minor extension of the expression calculator. For example, the expression 2 + 3 * 4
can be converted to tree.
└──+
├── 2
└──*
├── 3
└── 4
It can be written on C manner “add(2, mul(3, 4))
” form. Let's write it reversely: ” (2,(3,4)mul)add
”. This form is called as reverse polish notation (RPN) and byte-codes can be easily generated:
Int(2) Int(3), Int(4), Mul<I,I>, Addl<I,I>
Int
(number) operation pushes the number to the stack. Mul
/Add
operation pops two parameters from the stack and pushes the result.
Practical benefit of the formula compiler comes only when functions are used. Let's consider “liner map” 2 + 3 *GetX()
. The byte-code will be:
Int(2) Int(3), Call<0>, Mul<I,I>, Addl<I,I>
For example, this functionally can be cyclically applied to X database column to obtain Y column (Moreover, the provided demo does four calculations simultaneously, but it is rather the scope of another paper).
All byte-codes consist of two fields: type and union for integer, float and pointer to string.
struct byte_code
{
OpCode type;
union {
int val_i;
float val_f;
const char* val_s;
};
byte_code(): type(opNop), val_i(0) {};
The union represents the operand for the simple “immediate addressing mode”. The type is the operation code based upon the following enumerator:
enum OpCode
{
opFatal = -1, opNop = 0,
opInt = 1, opFloat = 2, opStr = 3, opMaskType = 0x03,
opError = 1, opNeg = 2, opPos = 3, opCall = 4,
opToInt = 5, opToFloat = 6, opToStr = 7,
opAdd = 2, opSub = 3, opMul = 4, opDiv = 5,
};
The real operation code is a little more complex. It contains types for stack operands.
BNFLite Grammar classes (Token, Lexem and Rule)
The BNFLite description is similar to EBNF specifications above.
Number
Token digit1_9('1', '9');
Token DIGIT("0123456789");
Lexem i_digit = 1*DIGIT;
Lexem frac_ = "." + i_digit;
Lexem int_ = "0" | digit1_9 + *DIGIT;
Lexem exp_ = "Ee" + !Token("+-") + i_digit;
Lexem number_ = !Token("-") + int_ + !frac_ + !exp_;
Rule number = number_;
The Token
class represents terminals, which are symbols. The Lexem
constructs string
s of symbols. Parsing is a process of analyzing tokens and lexemes of the input stream. The unary C operator `*
` means the construction can be repeated from 0 to infinity. The binary C operator `*
` means the construction like 1*DIGIT
can be repeated from 1 to infinity. The unary C operator `!
` means the construction is optional.
Practically, Rule
is similar to Lexem
except for callbacks and spaces sensitivity. The construction Rule number_ = !Token("-") + int_ + !frac_ + !exp_;
allows incorrect spaces, e.g., between integer and fractional parts.
Strings and identifiers
Token az_("_"); az_.Add('A', 'Z'); az_.Add('a', 'z');
Token az01_(az_); az01_.Add('0', '9');
Token all(1,255); all.Remove("\"");
Lexem identifier = az_ + *(az01_);
Lexem quotedstring = "\"" + *all + "\"";
The Token all
represents all symbols except quote;
Major parsing rules
Rule expression;
Rule unary;
Rule function = identifier + "(" + !(expression + *("," + expression)) + ")";
Rule elementary = AcceptFirst()
| "(" + expression + ")"
| function
| number
| quotedstring + printErr
| unary;
unary = Token("-") + elementary;
Rule primary = elementary + *("*%/" + elementary);
expression = primary + *("+-" + primary);
Expression
and unary
rules are recursive. They need to be declared earlier. The AcceptFirst()
statement changes parser behavior from the default mode to “accept best” for this particular Rule
. After that, the parser chooses a first appropriate construction instead of the most appropriate one.
Parser Calls and Callbacks
Recursive descent parser implies tree composition. The user needs a means to track the tree traversal recursion. First of all, the interface object structure to interact with the parser should be specified.
typedef Interface< std::list<byte_code> > Gen;
const char* tail = 0;
Gen result;
int tst = Analyze(expression, expr.c_str(), &tail, result);
Parsing starts by the Analyze
call with the following arguments:
expression
– root Rule object of BNFlite gramma expr.c_str()
– string of the expression to be compiled to byte-code tail
– pointer to end of text in case of successful parsing or to place where parsing is stopped due to error result
– final object from users callbacks tst
– contains result of parsing implemented as bitwise flags, error if negative
The Callback function accepts vector of byte-code lists formed early. It returns the list of new formed byte-codes. For example, Gen DoNumber(std::vector<Gen>& res)
accepts the one element vector representing a number. It has to return the result Gen
object with filled user's data
(either Int<>
or Float<>
).
In common case, the callback Gen DoBinary(std::vector<Gen>& res)
accepts the left byte-code vector, the sign of operation(+**/
), and the right byte-code vector. The callback just joins left and right byte-code vectors and generates the tail byte-code according to the sign.
Comparison with Boost::Spirit Implementation
This is a grammar part implemented by means of boost::spirit
template library:
namespace q = boost::spirit::qi;
typedef boost::variant<
int, float, std::string,
boost::recursive_wrapper<struct binary_node>,
boost::recursive_wrapper<struct unary_node>,
boost::recursive_wrapper<struct call_node>
> branch_t;
template <typename Iterator>
struct SParser: q::grammar<Iterator, branch_t, q::space_type>
{
q::rule<Iterator, branch_t, q::space_type> expression, primary, elementary, unary;
q::rule<Iterator, std::string()> quotedstring;
q::rule<Iterator, std::string()> identifier;
q::rule<Iterator, std::vector<branch_t>, q::space_type> arglist;
q::rule<Iterator, call_node, q::space_type> function;
SParser() : SParser::base_type(expression)
{
using boost::phoenix::construct;
expression = primary[q::_val = q::_1]
>> *('+' > primary[q::_val = construct<binary_node>(q::_val, q::_1, '+')]
| '-' > primary[q::_val = construct<binary_node>(q::_val, q::_1, '-')] );
primary = elementary[q::_val = q::_1]
>> *('*' > elementary[q::_val = construct<binary_node>(q::_val, q::_1, '*')]
| '/' > elementary[q::_val = construct<binary_node>(q::_val, q::_1, '/')] );
unary = '-' > elementary[q::_val = construct<unary_node>(q::_1, '-')];
elementary = q::real_parser<float, q::strict_real_policies<float>>()[q::_val = q::_1]
| q::int_[q::_val = q::_1]
| quotedstring[q::_val = q::_1]
| ('(' > expression > ')')[q::_val = q::_1]
| function[q::_val = q::_1]
| unary[q::_val = q::_1];
quotedstring = '"' > *(q::char_ - '"') > '"';
identifier = (q::alpha | q::char_('_')) >> *(q::alnum | q::char_('_'));
arglist = '(' > (expression % ',')[q::_val = q::_1] > ')';
function = (identifier > arglist)[q::_val = construct<call_node>(q::_1, q::_2)];
on_error(expression, std::cout << boost::phoenix::val("Error "));
}
};
Please, refer to boost::spirit
documentation for details. The example is provided for demonstration similarity of boost::spirit
grammar to BNFlite one.
Boost::spirit
is a more mature tool and does not use callbacks. For each “rule” it calls constructors of classes which should be member of boost::variant
. Therefore, boost::spirit
cannot be used for one-pass compilers. However, boost::variant
can be effective if something more complex is required (e.g. byte-code optimizers).
Problems of boost::spirit
lie in another area. Any inclusion of boost::spirit
considerably increases project compiler time. Another significant drawback of boost::spirit
is related to run-time debugging of the grammar - it is not possible at all!
Debugging of BNFLite Grammar
Writing grammar by EDSL is unusual and the user does not have full understanding about the parser. If the Analyze
call returns an error for the correct text, then the user always should take into consideration the possibility of grammar bugs.
Return code
Return code from Analyze call can contain flags related to the grammar. For example, eBadRule
, eBadLexem
flags mean the tree of rules is not properly built.
Names and breakpoints
The user can assign a name to the Rule
. It can help to track recursive descent parser using Rule::_parse
function. Debugger stack (history of function calls) can inform which Rule was applied and when. The user just needs to watch the this.name
variable. It is not as difficult as it seems at first glance.
Grammar subsets
Analyze
function can be applied as unit test to any Rule representing subset of grammar.
Tracing
Function with prototype “bool foo(const char* lexem, size_t len)
” can be used in BNFLite expressions for both reasons: to obtain temporary results and to inform about predicted errors.
This function will print the parsed number:
static bool DebugNumber(const char* lexem, size_t len)
{ printf("The number is: %.*s;\n", len, lexem); return true; }
Rule number = number_ + DebugNumber;
The function should return true
if result is correct.
Let's assume the numbers with leading ‘+
’ are not allowed.
static bool ErrorPlusNumber(const char* lexem, size_t len)
{printf("The number %.*s with plus is not allowed\n", len, lexem); return false;}
Lexem number_p = !Token("+") + int_ + !frac_ + !exp_ + ErrorPlusNumber;
Rule number = number_ | number_p;
The function should return false
to inform the parser the result is not fit. C++11 constructions like below are also possible:
Rule number = number_ | [](const char* lexem, size_t len)
{ return !printf("The number %.*s with plus is not allowed\n", len, lexem); }
Building and Run
The attached simplest formula compiler package has the following C++ files:
- main.cpp - starter of byte-code compiler and interpreter
- parser.cpp - BNFLite parser with grammar section and callbacks
- code_gen.cpp - byte-code generator
- code_lib.cpp - several examples of built-in functions (e.g POW(2,3) - power: 2*2*2)
- code_run.cpp - byte-code interpreter (used SSE2 for parallel calculation of 4 formulas)
The package has several prebuilt projects but generally it can be built as:
>$ g++ -O2 -march=pentium4 -std=c++14 -I.. code_gen.cpp parser.cpp code_lib.cpp main.cpp code_run.cpp
Function GetX()
returns four values: 0, 1, 2, 3
. It can be used in expression to be compiled and calculated:
> $ a.exe "2 + 3 *GetX()"
5 byte-codes in: 2+3*GetX()
Byte-code: Int(2),Int(3),opCall<I>,opMul<I,I>,opAdd<I,I>;
result = 2, 5, 8, 11;
Conclusion
The presented formula compiler is the realization of several ideas in order to demonstrate its feasibility. First of all, it is parallel computation according to compiled formula. Secondly, it introduces BNFLite header library to spread existing concept of applicability of BNF forms. Now a fast parser implementation can be easily created for specific customer requirements where BNF-like language is used.
References