NetBase - A Minimal .NET Database with a Small SQL parser, Part 2

buttonpusher

4.93/5 (11 votes)

3 Nov 2009LGPL34 min read

31K

114

NetBase is a small database system that reads and writes to DBF-like files, with a SQL front-end. The second part of this series discusses the SQL parser.

Download source code from part 1 - 66.94 KB

Introduction

NetBase is a C# project that creates and reads DBF-like files. It supports a tiny subset of SQL to create, insert, and select data from a table. This small project has been an interesting crutch for writing a basic parser. The project also includes a small utility that provides a front-end.

See the first part of this two-part series for more information:

Introduction and a look at the DBF file format (the storage namespace)
The SQL parser and query builder (the sql namespace) - this article

This code is also hosted in github at http://github.com/buttonpusher/NetBase/.

Before we start, to follow through the examples, you should open up the NetBase.Viewer application and run the following SQL statements to set up a table ready for the examples below.

SQL

CREATE TABLE part2 (first,second,third)
INSERT INTO part2 (first,second,third) VALUES (1,2,3)
INSERT INTO part2 (first,second,third) VALUES (4,5,6)
INSERT INTO part2 (first,second,third) VALUES (7,8,9)
INSERT INTO part2 (first,second,third) VALUES (4,10,11)

Parsing Basics

The process of parsing involves building a data structure that represents the meaning of a text string. In the NetBase application, this data structure then determines what actions need to be made on the database to give the desired results. The parser in NetBase uses techniques very similar to those described by Jack Crenshaw's "Let's Build a Compiler". This type of parser is called a "recursive descent parser". There is a lot of theory about parsing and compiling; however, we can take a quick overview at the basics behind this kind of parsing, in this case, in the context of SQL.

The idea is that each SQL statement can be described as a tree structure. The root node of the tree encompasses the whole string:

SQL

SELECT third FROM part2 WHERE first = 4

The first layer of the tree divides the whole string into meaningful divisions. In the case of NetBase, this string would be divided into two branches of the tree, the SELECT and FROM clauses:

SQL

SELECT third FROM part2

and the WHERE clause:

SQL

WHERE first = 4

The idea of a recursive descent parser is that each node in the tree - subdivision of the string - is parsed by a function, which calls other functions that parse the child nodes of the tree. This is the descending aspect. The recursive aspect is simply that functions may eventually call themselves, although it is worth noting that this does not happen in NetBase's simplistic version of SQL.

The code involved in parsing our SELECT statement in NetBase is as follows - it has been reorganized to better illustrate this particular example.

private void Expression()
{
    if (tokenizer.Token == "SELECT")
    {
        SelectExpression();
    }
    else if (tokenizer.Token == "INSERT")
    {
        InsertExpression();
    }
    else if (tokenizer.Token == "CREATE")
    {
        CreateExpression();
    }
    else
    {
        throw new QuerySyntaxException("Syntax Error: Expected \"select\";" + 
                                       " \"insert\"; or \"create table\" clause");
    }
}

private void SelectExpression()
{
    tokenizer.Match("SELECT");
    Query = new SelectQuery();

    FieldList();

    tokenizer.Match("FROM");

    Query.TableName = tokenizer.Token;
    tokenizer.GetToken();


    if (tokenizer.Token == "JOIN")
    {
        JoinClause();
    }

    if (tokenizer.Token == "WHERE")
    {
        WhereClause();
    }

}

private void FieldList()
{
    if (tokenizer.Token == "*")
    {
        tokenizer.Match("*");
        ((SelectQuery)this.Query).SelectAll = true;
        return;
    }

    do
    {
        if (tokenizer.Token == ",") tokenizer.Match(",");
        ((SelectQuery)this.Query).Fields.Add(tokenizer.Token);
        tokenizer.GetToken();
    } while (tokenizer.Token == ",");
}

private void WhereClause()
{
    tokenizer.Match("WHERE");
    
    do
    {
        if (tokenizer.Token == "AND") tokenizer.Match("AND");

        string Field = tokenizer.Token;
        tokenizer.GetToken();
        tokenizer.Match("=");
        string value = tokenizer.Token;
        Where w = new Where();
        w.Field = Field;
        w.Value = value;
        ((SelectQuery)this.Query).Wheres.Add(w);
        tokenizer.GetToken();
    } while (tokenizer.Token == "AND");
}

It is possible to see from this code snippet that, for example, the WHERE clause supports only AND and not OR. Note that the JOIN functionality is incomplete.

It may be useful to step through this in a debugger to see how the internal representation of the query is constructed. The QueryBuilder class is very large, and it would definitely be possible to improve the design of this part of the database.

Using the Parsed Representation to Run the Query

The parsed representation of the query above is now wholly contained in an instance of SelectQuery. From here, we can determine what is required to return the right results. This is done by the Selector class in the executive namespace.

The Execute function of this class starts by allocating a Storage.MemoryTable instance to store the results of the query. It then constructs the set of columns required from the Fields function (which contains the set of fields in the SELECT clause of the original query string).

MemoryTable nt = new MemoryTable();

ITable t = db.Tables[Query.TableName];

// We're going to start by building a column
// list for our new tables.
if (Query.SelectAll)
{
    // @TODO these should be deep copies
    foreach (Column c in t.Columns)
    {
        nt.Columns.Add(c);
    }

    foreach (Join j in Query.Joins)
    {
        ITable jt = db.Tables[j.TableName];
        foreach (Column c in jt.Columns)
        {
            nt.Columns.Add(c);
        }
    }
}
else
{
    // Uh-oh, we haven't implemented support
    // for joins when using named fields.
    foreach (string s in Query.Fields)
    {
        Column c = t.Columns.Find(o => o.Name == s);
        nt.Columns.Add(c);
    }
}

The next step is to move through the rows of the source table, using the Filter method of the sql.Where class to determine which rows should be included. If a row should be included, then the contents of the appropriate columns are copied into a new row, which is then added to the MemoryTable.

// "High performance NetBase" is a book yet to
// be written. We're going to do a full table
// seek every time, and we don't have any support
// for indexes.
t.Reset();
while (t.HasRows)
{
    Row r = t.NextRow();
    bool pass = true;
    foreach (Where w in Query.Wheres)
    {
        if (!w.Filter(r))
        {
            pass = false;
            break;
        }
    }

    if (pass)
    {
        Row nr = new Row(nt);
        nr.CopyData(r, nt.Columns);

        nt.AddRow(nr);
    }

Note that in the above snippet, I have excised a section of unfinished code which attempts to implement joins, because this functionality is incomplete and irrelevant to this article.

The MemoryTable nt now contains the results of the query, and can be returned up through the API to the front-end, or perhaps in a later project, sent across the network to a client application. The results of the query as displayed in the NetBase.Viewer application will be:

third

The comments about indexes are referring to the inefficient way in which every row in the table is checked to see if it matches the where clause. Modern RDBMSs use - amongst other techniques - data structures like b-trees to efficiently seek for matching rows, and therefore do not always have to scan through the entire table.

Conclusion

There are a lot of additional subprojects that could be interesting to work on with NetBase.

Implementing the join functionality
Implementing a network layer
Enabling concurrent access (locking)
Adding b-trees, indexes, and other performance optimisations

Perhaps at some point, I or someone else will undertake these projects; they'd be interesting by themselves.

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)