Introduction
NetBase
is a C# project that creates and reads DBF-like files. It supports a tiny subset of SQL to create, insert, and select data from a table. This small project has been an interesting crutch for writing a basic parser. The project also includes a small utility that provides a front-end.
See the first part of this two-part series for more information:
- Introduction and a look at the DBF file format (the storage namespace)
- The SQL parser and query builder (the
sql
namespace) - this article
This code is also hosted in github at http://github.com/buttonpusher/NetBase/.
Before we start, to follow through the examples, you should open up the NetBase.Viewer
application and run the following SQL statements to set up a table ready for the examples below.
CREATE TABLE part2 (first,second,third)
INSERT INTO part2 (first,second,third) VALUES (1,2,3)
INSERT INTO part2 (first,second,third) VALUES (4,5,6)
INSERT INTO part2 (first,second,third) VALUES (7,8,9)
INSERT INTO part2 (first,second,third) VALUES (4,10,11)
Parsing Basics
The process of parsing involves building a data structure that represents the meaning of a text string. In the NetBase
application, this data structure then determines what actions need to be made on the database to give the desired results. The parser in NetBase
uses techniques very similar to those described by Jack Crenshaw's "Let's Build a Compiler". This type of parser is called a "recursive descent parser". There is a lot of theory about parsing and compiling; however, we can take a quick overview at the basics behind this kind of parsing, in this case, in the context of SQL.
The idea is that each SQL statement can be described as a tree structure. The root node of the tree encompasses the whole string:
SELECT third FROM part2 WHERE first = 4
The first layer of the tree divides the whole string into meaningful divisions. In the case of NetBase
, this string would be divided into two branches of the tree, the SELECT
and FROM
clauses:
SELECT third FROM part2
and the WHERE
clause:
WHERE first = 4
The idea of a recursive descent parser is that each node in the tree - subdivision of the string - is parsed by a function, which calls other functions that parse the child nodes of the tree. This is the descending aspect. The recursive aspect is simply that functions may eventually call themselves, although it is worth noting that this does not happen in NetBase
's simplistic version of SQL.
The code involved in parsing our SELECT
statement in NetBase
is as follows - it has been reorganized to better illustrate this particular example.
private void Expression()
{
if (tokenizer.Token == "SELECT")
{
SelectExpression();
}
else if (tokenizer.Token == "INSERT")
{
InsertExpression();
}
else if (tokenizer.Token == "CREATE")
{
CreateExpression();
}
else
{
throw new QuerySyntaxException("Syntax Error: Expected \"select\";" +
" \"insert\"; or \"create table\" clause");
}
}
private void SelectExpression()
{
tokenizer.Match("SELECT");
Query = new SelectQuery();
FieldList();
tokenizer.Match("FROM");
Query.TableName = tokenizer.Token;
tokenizer.GetToken();
if (tokenizer.Token == "JOIN")
{
JoinClause();
}
if (tokenizer.Token == "WHERE")
{
WhereClause();
}
}
private void FieldList()
{
if (tokenizer.Token == "*")
{
tokenizer.Match("*");
((SelectQuery)this.Query).SelectAll = true;
return;
}
do
{
if (tokenizer.Token == ",") tokenizer.Match(",");
((SelectQuery)this.Query).Fields.Add(tokenizer.Token);
tokenizer.GetToken();
} while (tokenizer.Token == ",");
}
private void WhereClause()
{
tokenizer.Match("WHERE");
do
{
if (tokenizer.Token == "AND") tokenizer.Match("AND");
string Field = tokenizer.Token;
tokenizer.GetToken();
tokenizer.Match("=");
string value = tokenizer.Token;
Where w = new Where();
w.Field = Field;
w.Value = value;
((SelectQuery)this.Query).Wheres.Add(w);
tokenizer.GetToken();
} while (tokenizer.Token == "AND");
}
It is possible to see from this code snippet that, for example, the WHERE
clause supports only AND
and not OR
. Note that the JOIN
functionality is incomplete.
It may be useful to step through this in a debugger to see how the internal representation of the query is constructed. The QueryBuilder
class is very large, and it would definitely be possible to improve the design of this part of the database.
Using the Parsed Representation to Run the Query
The parsed representation of the query above is now wholly contained in an instance of SelectQuery
. From here, we can determine what is required to return the right results. This is done by the Selector
class in the executive namespace.
The Execute
function of this class starts by allocating a Storage.MemoryTable
instance to store the results of the query. It then constructs the set of columns required from the Fields
function (which contains the set of fields in the SELECT
clause of the original query string).
MemoryTable nt = new MemoryTable();
ITable t = db.Tables[Query.TableName];
if (Query.SelectAll)
{
foreach (Column c in t.Columns)
{
nt.Columns.Add(c);
}
foreach (Join j in Query.Joins)
{
ITable jt = db.Tables[j.TableName];
foreach (Column c in jt.Columns)
{
nt.Columns.Add(c);
}
}
}
else
{
foreach (string s in Query.Fields)
{
Column c = t.Columns.Find(o => o.Name == s);
nt.Columns.Add(c);
}
}
The next step is to move through the rows of the source table, using the Filter
method of the sql.Where
class to determine which rows should be included. If a row should be included, then the contents of the appropriate columns are copied into a new row, which is then added to the MemoryTable
.
t.Reset();
while (t.HasRows)
{
Row r = t.NextRow();
bool pass = true;
foreach (Where w in Query.Wheres)
{
if (!w.Filter(r))
{
pass = false;
break;
}
}
if (pass)
{
Row nr = new Row(nt);
nr.CopyData(r, nt.Columns);
nt.AddRow(nr);
}
Note that in the above snippet, I have excised a section of unfinished code which attempts to implement joins, because this functionality is incomplete and irrelevant to this article.
The MemoryTable
nt
now contains the results of the query, and can be returned up through the API to the front-end, or perhaps in a later project, sent across the network to a client application. The results of the query as displayed in the NetBase.Viewer
application will be:
The comments about indexes are referring to the inefficient way in which every row in the table is checked to see if it matches the where
clause. Modern RDBMSs use - amongst other techniques - data structures like b-trees to efficiently seek for matching rows, and therefore do not always have to scan through the entire table.
Conclusion
There are a lot of additional subprojects that could be interesting to work on with NetBase
.
- Implementing the
join
functionality - Implementing a network layer
- Enabling concurrent access (locking)
- Adding b-trees, indexes, and other performance optimisations
Perhaps at some point, I or someone else will undertake these projects; they'd be interesting by themselves.