Convert EBNF to BNF Using parsertl & lexertl

Ben Hanson

5.00/5 (3 votes)

1 Apr 2016CPOL2 min read

20.1K

Convert EBNF to BNF

Introduction

Converting EBNF to BNF by hand is not difficult but it soon becomes tedious after a few iterations. Surely there must be a better way? As it turns out, yes, there is!

Background

Having whipped parsertl into shape this year, I wanted to try out some real world grammars. As we use IronPython as part of our product at work this seemed like an ideal candidate. Looking online the grammar is freely available, but it is described using EBNF.

Using the code

In order to use the following code you will need both parsertl and lexertl:

http://www.benhanson.net/parsertl.html

http://www.benhanson.net/lexertl.html

C++

#include <functional>
#include "../parsertl17/include/parsertl/generator.hpp"
#include "../lexertl17/include/lexertl/iterator.hpp"
#include <iostream>
#include "../parsertl17/include/parsertl/lookup.hpp"
#include "../lexertl17/include/lexertl/memory_file.hpp"

void read_ebnf(const char *start_, const char *end_str_)
{
    using token = parsertl::token<lexertl::citerator>;
    parsertl::rules grules_;
    lexertl::rules lrules_;
    lexertl::state_machine lsm_;

    struct tdata
    {
        parsertl::state_machine _sm;
        parsertl::match_results _results;
        token::token_vector _productions;
        std::string _lhs;
        std::stack<std::string> _rhs;
        std::map<std::string, std::size_t> _new_rule_ids;
        std::stack<std::pair<std::string, std::string>> _new_rules;
        std::map<std::pair<std::string, std::string>, std::string> _seen;
    } data_;

    std::map<std::size_t, void (*) (tdata &)> actions_map_;

    grules_.token("IDENTIFIER LHS TERMINAL");
    grules_.push("start", "grammar");
    grules_.push("grammar", "rule "
        "| grammar rule");
    actions_map_[grules_.push("rule", "lhs rhs_or opt_semi")] =
        [](tdata &data_)
    {
        assert(data_._rhs.empty() || data_._rhs.size() == 1);
        std::cout << data_._lhs << ": ";

        if (!data_._rhs.empty())
        {
            std::cout << data_._rhs.top();
            data_._rhs.pop();
        }

        std::cout << ";\n";

        while (!data_._new_rules.empty())
        {
            std::cout << data_._new_rules.top().first;
            std::cout << ": ";
            std::cout << data_._new_rules.top().second;
            std::cout << ";\n";
            data_._new_rules.pop();
        }
    };
    grules_.push("opt_semi", "%empty | ';'");
    actions_map_[grules_.push("lhs", "LHS")] =
        [](tdata &data_)
    {
        const token &token_ = data_._results.dollar(0, data_._sm, data_._productions);
        const char *end_ = token_.second - 1;

        while (strchr(" \t\r\n", *(end_ - 1))) --end_;

        data_._lhs.assign(token_.first, end_);
    };
    grules_.push("rhs_or", "opt_list");
    actions_map_[grules_.push("rhs_or", "rhs_or '|' opt_list")] =
        [](tdata &data_)
    {
        const token &token_ = data_._results.dollar(1, data_._sm, data_._productions);
        std::string r_ = token_.str() + ' ' + data_._rhs.top();

        data_._rhs.pop();

        if (data_._rhs.empty())
        {
            data_._rhs.push(r_);
        }
        else
        {
            data_._rhs.top() += ' ' + r_;
        }
    };
    grules_.push("opt_list", "%empty | rhs_list");
    grules_.push("rhs_list", "rhs");
    actions_map_[grules_.push("rhs_list", "rhs_list opt_comma rhs")] =
        [](tdata &data_)
    {
        std::string r_ = data_._rhs.top();

        data_._rhs.pop();
        data_._rhs.top() += ' ' + r_;
    };
    grules_.push("opt_comma", "%empty | ','");
    actions_map_[grules_.push("rhs", "IDENTIFIER")] =
        [](tdata &data_)
    {
        const token &token_ = data_._results.dollar(0, data_._sm, data_._productions);

        data_._rhs.push(token_.str());
    };
    actions_map_[grules_.push("rhs", "TERMINAL")] =
        [](tdata &data_)
    {
        const token &token_ = data_._results.dollar(0, data_._sm, data_._productions);

        data_._rhs.push(token_.str());
    };
    actions_map_[grules_.push("rhs", "'[' rhs_or ']'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        pair_.second = "%empty | " + data_._rhs.top();

        auto pair2_ = std::pair(data_._lhs, pair_.second);
        auto iter_ = data_._seen.find(pair2_);

        if (iter_ == data_._seen.end())
        {
            ++counter_;
            pair_.first = data_._lhs + '_' + std::to_string(counter_);
            data_._rhs.top() = pair_.first;
            data_._new_rules.push(pair_);
            data_._seen[pair2_] = pair_.first;
        }
        else
        {
            data_._rhs.top() = iter_->second;
        }
    };
    actions_map_[grules_.push("rhs", "rhs '?'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        ++counter_;
        pair_.first = data_._lhs + '_' + std::to_string(counter_);
        pair_.second = "%empty | " + data_._rhs.top();
        data_._rhs.top() = pair_.first;
        data_._new_rules.push(pair_);
    };
    actions_map_[grules_.push("rhs", "'{' rhs_or '}'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        ++counter_;
        pair_.first = data_._lhs + '_' + std::to_string(counter_);
        pair_.second = "%empty | " + pair_.first + ' ' + data_._rhs.top();
        data_._rhs.top() = pair_.first;
        data_._new_rules.push(pair_);
    };
    actions_map_[grules_.push("rhs", "rhs '*'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        ++counter_;
        pair_.first = data_._lhs + '_' + std::to_string(counter_);
        pair_.second = "%empty | " + pair_.first + ' ' + data_._rhs.top();
        data_._rhs.top() = pair_.first;
        data_._new_rules.push(pair_);
    };
    actions_map_[grules_.push("rhs", "'{' rhs_or '}' '-'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        ++counter_;
        pair_.first = data_._lhs + '_' + std::to_string(counter_);
        pair_.second = data_._rhs.top() + " | " +
            pair_.first + ' ' + data_._rhs.top();
        data_._rhs.top() = pair_.first;
        data_._new_rules.push(pair_);
    };
    actions_map_[grules_.push("rhs", "rhs '+'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        ++counter_;
        pair_.first = data_._lhs + '_' + std::to_string(counter_);
        pair_.second = data_._rhs.top() + " | " +
            pair_.first + ' ' + data_._rhs.top();
        data_._rhs.top() = pair_.first;
        data_._new_rules.push(pair_);
    };
    actions_map_[grules_.push("rhs", "'(' rhs_or ')'")] =
        [](tdata &data_)
    {
        std::size_t &counter_ = data_._new_rule_ids[data_._lhs];
        std::pair<std::string, std::string> pair_;

        pair_.second = data_._rhs.top();

        auto pair2_ = std::pair(data_._lhs, pair_.second);
        auto iter_ = data_._seen.find(pair2_);

        if (iter_ == data_._seen.end())
        {
            ++counter_;
            pair_.first = data_._lhs + '_' + std::to_string(counter_);
            data_._rhs.top() = pair_.first;
            data_._new_rules.push(pair_);
            data_._seen[pair2_] = pair_.first;
        }
        else
        {
            data_._rhs.top() = iter_->second;
        }
    };

    parsertl::generator::build(grules_, data_._sm);

    lrules_.insert_macro("NAME", "[A-Za-z][_0-9A-Za-z]*");
    lrules_.push("{NAME}", grules_.token_id("IDENTIFIER"));
    lrules_.push("{NAME}\\s*[:=]", grules_.token_id("LHS"));
    lrules_.push(",", grules_.token_id("','"));
    lrules_.push(";", grules_.token_id("';'"));
    lrules_.push("\\[", grules_.token_id("'['"));
    lrules_.push("\\]", grules_.token_id("']'"));
    lrules_.push("[?]", grules_.token_id("'?'"));
    lrules_.push("[{]", grules_.token_id("'{'"));
    lrules_.push("[}]", grules_.token_id("'}'"));
    lrules_.push("[*]", grules_.token_id("'*'"));
    lrules_.push("[(]", grules_.token_id("'('"));
    lrules_.push("[)]", grules_.token_id("')'"));
    lrules_.push("[|]", grules_.token_id("'|'"));
    lrules_.push("[+]", grules_.token_id("'+'"));
    lrules_.push("-", grules_.token_id("'-'"));
    lrules_.push("'(\\\\([^0-9cx]|[0-9]{1,3}|c[@a-zA-Z]|x\\d+)|[^'])+'|"
        "[\"](\\\\([^0-9cx]|[0-9]{1,3}|c[@a-zA-Z]|x\\d+)|[^\"])+[\"]",
        grules_.token_id("TERMINAL"));
    lrules_.push("#[^\r\n]*|\\s+|[(][*](.{+}[\r\n])*?[*][)]", lrules_.skip());
    lexertl::generator::build(lrules_, lsm_);

    lexertl::citerator iter_(start_, end_str_, lsm_);
    lexertl::citerator end_;

    data_._results.reset(iter_->id, data_._sm);
    std::cout << "%%\n";

    while (data_._results.entry.action != parsertl::action::error &&
        data_._results.entry.action != parsertl::action::accept)
    {
        if (data_._results.entry.action == parsertl::action::reduce)
        {
            auto i_ = actions_map_.find(data_._results.entry.param);

            if (i_ != actions_map_.end())
            {
                i_->second(data_);
            }
        }

        parsertl::lookup(iter_, data_._sm, data_._results, data_._productions);
    }

    if (data_._results.entry.action == parsertl::action::error)
        throw std::runtime_error("Syntax error");
    else
        std::cout << "%%\n"; 
}

int main()
{
    try
    {
        lexertl::memory_file mf_("grammars/python/python.ebnf");

        read_ebnf(mf_.data(), mf_.data() + mf_.size());
    }
    catch (const std::exception &e)
    {
        std::cout << e.what() << '\n';
    }
}

Points of Interest

Note that grammars using EBNF often use multi-character literals. Although parsertl accepts these due to the fact it defines ids for tokens automatically, if you want to use the resultant BNF grammar with yacc or bison, you will have to convert these to normal tokens by hand.

Note that most EBNF grammars actually describe LL grammars and whilst LL is a subset of LR, it is not a subset of LALR. This is true of the Python grammar I mentioned in the beginning. Fortunately some manual intervention can resolve the warnings given when running the converted grammar through parsertl or bison.

First we add in the tokens:

%token DEDENT ENDMARKER INDENT NAME NEWLINE NUMBER STRING

Remove or comment out the following rules:

single_input
eval_input
eval_input_1
encoding_decl

Change

import_from_4: import_from_2 dotted_name | import_from_3;

import_from_4: dotted_name | import_from_3 dotted_name | import_from_3;

and remove rule import_from_2.

Ideally I would like such conversions to be automated also. This will take more research and if it is not reasonably achievable, it may well be worth supporting LL(1) in addition to LALR(1) in parsertl given its popularity with modern real-world grammars such as Python.

History

01/04/16 First Release.

04/04/16 Now copes with empty rules.

05/04/16 Now showing manual conversion of BNF for Python to be LALR(1) compatible (without warnings).

28/09/16 Fixed ambiguous grammar by introducing LHS token. Introduced lambda map for actions.

22/03/18 Switched .|\n to .{+}[\r\n] to include \r for Windows users.

30/04/19 Corrected #include paths.

06/05/19 Now auto de-duplicating rules for [] and (). Updated Python section to latest version.

21/01/23 Updated to latest parsertl interfaces.

15/02/24 Updated to lexertl17 and parsertl17.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)