Introduction
I've wanted to write an application that would allow me to execute queries against my music collection for some time now. Having been burning my CDs to MP3 and WMA for some years now, I have thousands of songs, and I've been looking for something a bit more robust than what most music players I've used (like the Windows Media Player) allow. With the introduction of LINQ in .NET 3.5, this seemed like a good opportunity to create an app I've been thinking about for a while, and at the same time dig into a new technology. This set of code is the result.
My ultimate goal is to create a set of libraries that can be used to allow for executing SQL queries against any data source that can represent itself in a relational, semi-relational, and tabular format.
Update
11/11/2009 - An evolution of the query evaluation code is available in an updated version: Dynamically evaluated SQL LINQ queries. The updated version operates on any IEnumerable<T>
, not just IDictionary<string, object>
.
Since the initial posting of this article, I've continued to toy around with this set of code, and have made some modifications and additions.
- User Interface
- Syntax highlighting (thanks Uri Guy)
- Added some nice looking icons from FAMFAMFAM
- Trace output window
DataGridView
printing and print preview (thanks Salan Al-Ani)- Added a Schema View dialog that displays the available file property reader namespaces and the properties that each exposes:
- Other miscellaneous changes
- SQL syntax support
- Added support for expressions within Aggregates (e.g.,
SELECT MAX(duration / 1000)
) - Added
DELETE
support (disabled in the linked code, but easily enabled by removing a throws
statement) - Added ability to reference fields by alias or name in sub clauses
- File property reader refactoring
- Abstracted file property reading into a pluggable implementation
- Separated
System.IO.File
properties from media properties - Added an (experimental) assembly file property reader
- Separated SQL evaluation (SQLLinq) from file property reading
- Reorganized the
SyntaxTree
namespace - Miscellaneous evaluation bug fixes
Background
The first thing I came up against with LINQ and the vast majority of examples for its usage was they all assume implicit knowledge of the structure of the underlying data and compile time knowledge of the structure of the desired results. I'm sure this holds true for the vast majority of "line of business" applications where LINQ will really pay dividends. Creating a LINQ query to aggregate and manipulate a sales report or a product catalog page entails detailed knowledge of the data and the requirements of the resulting aggregates. Looking at LINQ to SQL, it generates SQL statements to match the runtime LINQ expression execution, and provides a very nice Object Relational Mapping tool that, in the past, required extensive infrastructure coding or third party tools.
The compiled code wouldn't know the structure of the underlying data (other than that it consists of name value pairs), nor would it know what fields or aggregations were needed, as the exact data manipulation and result structure is defined at runtime, not compile time. Building up a LINQ statement like:
EnumerableRowCollection<DataRow> query =
from order in orders.AsEnumerable()
where order.Field<bool>("OnlineOrderFlag") == true
orderby order.Field<decimal>("TotalDue")
select order;
requires knowing, at a minimum, the order and number of clauses in the where
statement as well as the order by
statement. So, after some initial reading and investigation into LINQ, I came to the conclusion that I needed to go the other way: SQL to LINQ rather than LINQ to SQL. That lead to some other interesting areas like needing a SQL parser and dynamic construction of complex expressions, but it did make for a good crash course in learning LINQ.
Using the Code
The structure of the code consists of a number of layers:
FileDbProvider
- This layer contains an IDataAdapter
and associated types like a derived DbConnection
and DbCommand
. It is the layer that the example user interface interacts with.File.PropertyProvider
- This contains a reader for System.IO.File
properties as well as the base types for other file property readers.Media.PropertyProvider
- This layer knows how to access the tags from media files, and uses the SqlLinq
layer to retrieve and evaluate SQL statements passed in from the data provider layer.SqlLinq
- This layer contains the SQL parser, and exposes the resulting parse tree along with an API to allow the evaluation of arbitrary sets of data.
The UI interacts with the data provider layer much like it would with any other provider:
DataSet data = new DataSet();
MusDbConnection connection = new MusDbConnection(
"root='C:\\Music';recurse=true;FileExtensions=wma,mp3");
connection.Open();
IDataAdapter adapter = new MusDataAdapter("SELECT * FROM media", connection);
adapter.Fill(data);
The DataSet
can then be used just like an application would, with one retrieved from SQL Server, Oracle, Jet, or other data sources.
A Note about the Unit Tests
Most of the unit tests will fail when you run them from the attached code because they are written against a set of media files that, for obvious reasons, I haven't included with the download. They do provide basic examples of the SQL syntax that works, and should be pretty easy to point at some other set of media and made to pass again.
FileDbProvider Layer
This assembly ties together execution with the file property readers with SQL parsing and execution. It operates much like other SQL data providers.
The connection string is of the format name=value, with a semicolon separator like most other database connection formats.
Supported connection string parameters are:
- root - The rooted path from which queries will search. This parameter is required. Queries will be evaluated relative to this path. In the
SELECT
statement above, the query will search "C:\Music\soundtrack\". - recurse -
true
| false
- indicates whether to recurse directory structures during query evaluation. Defaults to true
if if not provided. - fileExtensions - The list of file extensions to include in the search. Defaults to *, if not provided.
Aside from using the SqlLinq layer to process SQL statements, the other thing that is done in this assembly is providing an implementation of IDataAdapter
and DbDataReader
so that the results can be returned to an application as a System.DataSet
. That is all pretty straightforward, and there are plenty of examples out there, so I won't delve into the details.
File.PropertyProvider
This assembly contains the base functionality for reading the properties of a file. The interface IPropertyReader
is the base type that any component needs to implement in order to be included in a file property query.
public interface IFilePropertyReader
{
string Namespace { get; }
bool Match(string filePath);
IEnumerable<KeyValuePair<string, object>> Read(string path,
IEnumerable<string> selectProperties);
IEnumerable<KeyValuePair<string, object>> Read(string path);
IEnumerable<KeyValuePair<string, Type>> GetFields();
}
In addition to implementing IFilePropertyReader
, there are some attributes that are used to define what file types a reader supports and define the "table name" that the reader type will bind to in a SQL statement.
By way of example, the class that basic file properties is declared in, looks like:
[FilePropertyReader("*")]
[PropertyTableName("File")]
[TypePropertyReader(typeof(FileReflectedPropertyReader))]
class FilePropertyReader : PropertyReader
The FilePropertyReader
attributes will cause the query engine to use the MediaFileProperyReader to retrieve properties from any file extension. The PropertyTableName
attribute above binds this reader to a logical table by the name of file; e.g., a SQL statement SELECT * FROM file
. This class reads any of the properties accessible from the System.IO.File
class, such as name, extension, size, etc.
Finally, this assembly dynamically loads other file property readers at runtime. It does this by reflecting over an assembly that ends with .PropertyProvider.dll, looking for classes that implement IFilePropertyReader
. It does this using a dynamic TypeLoader library.
Media.PropertyProvider
This assembly contains an implementation of IFilePropertyReader
that can return media properties from meta-data tags using the Windows Media Format SDK.
[FilePropertyReader("mp3")]
[FilePropertyReader("wma")]
[FilePropertyReader("asf")]
[FilePropertyReader("wmv")]
[PropertyTableName("Media")]
class MediaFilePropertyReader : IFilePropertyReader
I started with some simple code that could read ID3 v1.1 tags, but wanted something a bit more robust (ASF container formats, other ID3 tag versions etc.). A bit of searching lead to the Windows Media Format SDK, and now the retrieval of the actual tag values is all done using WMVCore.dll and the interfaces IWMMetadataEditor
, IWMMetadataEditor2
, and IWMHeaderInfo3
. The code that performs this retrieval started its life as part of one of the SDK samples. It's now modified to fit the needs of this application and my particular coding style. As such, I think it may be a bit more generically useful than the original sample as it doesn't do the same tag of formatting as it deserializes the tags (except for byte
arrays which are only returned as string representations indicating their length). If you are looking for some media tag access code, check out the FileMediaTags
class in the attached code. It should be pretty easy to yank out and bend to your needs.
The meat of retrieving the actual tags has a basic Win32 flavor to it; calling methods twice to get buffer sizes, and then passing them back in to be filled, null terminated strings etc. If you've worked with Windows APIs at all, nothing too surprising with this piece.
private static IWMHeaderInfo3 GetHeaderInfo(string path)
{
IWMMetadataEditor editor;
WMFSDKFunctions.WMCreateEditor(out editor);
IWMMetadataEditor2 editor2 = (IWMMetadataEditor2)editor;
editor2.OpenEx(path, FILE_ACCESS.GENERIC_READ, FILE_SHARE.FILE_SHARE_READ);
return (IWMHeaderInfo3)editor2;
}
private static void GetAllMediaTags(IDictionary<string, object> tags, string path)
{
IWMHeaderInfo3 headerInfo3 = GetHeaderInfo(path);
try
{
ushort wAttributeCount;
headerInfo3.GetAttributeCountEx(0, out wAttributeCount);
for (ushort wIndex = 0; wIndex < wAttributeCount; wIndex++)
{
WMT_ATTR_DATATYPE wAttribType;
string pwszName = null;
ushort wAttribNameLen = 0;
ushort pwLangIndex = 0;
uint pdwDataLength = 0;
headerInfo3.GetAttributeByIndexEx(0,
wIndex,
pwszName,
ref wAttribNameLen,
out wAttribType,
out pwLangIndex,
null,
ref pdwDataLength);
pwszName = new String('\0', wAttribNameLen);
ReadAndAddAttribue(tags, headerInfo3, wIndex, pwszName, pdwDataLength);
}
}
finally
{
((IWMMetadataEditor)headerInfo3).Close();
Marshal.FinalReleaseComObject(headerInfo3);
headerInfo3 = null;
}
}
The tags from each media file are returned from the loader in an IDictionary<string, object>
. The dictionary keys are case insensitive to match the case insensitivity of the SQL parsing. As the code iterates all of the files in the specified directory structure, it builds up an IList<IDictionary<string, object>>
. This IList
represents the result set from the query, with each IDictionary
being analogous to a single row. The list of rows is how data is returned to the provider layer.
The SqlLinq Layer
This is where all the good LINQ-y stuff happens! (Well, actually, there is more Linq.Expression
stuff going on in here than anything else.)
SQL Parsing
SQL parsing started with a very simplistic approach using what was little more than a rudimentary tokenizer. That got me as far as the simple SELECT field FROM source
part, but not much farther. A day or so of searching came up with nothing usable, until I came upon the Grammar Oriented Language Developer or GOLD application. This thing started as Devin Cook's master's thesis, and is really a remarkable tool.
What it does is take in the rules for an arbitrary language in BNF format and spit out a set of parse tables to be used later by a language specific state machine parse engine. There are even a number of engines available in multiple languages including C#. This code uses a C# engine by Vladimir Morozova, which is included in the download. There was even a specification for a simple SQL syntax available (extended slightly to add column aliases, boolean literals, and to allow HAVING
clauses to contain aggregate expressions). There is also a nice article here on CodeProject that explains the GOLD application and its use in more depth. Basically, this solved my problem perfectly, and kudos to the authors of the GOLD application.
Parsing proceeds from the leaves to the trunk, building up the parse tree along the way. As the parser traverses tokens and rules, it yields up a Reduction rule which corresponds to the various parts of the language. A custom attribute is used to map C# classes to SQL rules and instantiate them as the parser does its work.
[SyntaxNode(RuleConstants.RULE_WHERECLAUSE_WHERE)]
public class WhereClause : NonTerminalNode
{
public WhereClause()
{
}
public IEnumerable<IDictionary<string, T>> Evaluate<T>(
IEnumerable<IDictionary<string, T>> source)
{
return source.Where(CreateEvaluationFunction<T>());
}
private Func<IDictionary<string, T>, bool> CreateEvaluationFunction<T>()
{
PredicateNode predicate = FindChild<PredicateNode>();
if (predicate != null)
return predicate.CreateEvaluationFunction<T>();
LiteralNode node = FindChild<LiteralNode>();
if (node != null)
return node.CreateEvaluationFunction<T>();
Debug.Assert(false);
return null;
}
}
static class SyntaxRuleFactory
{
private static TypeLoader<NonTerminalNode, int> _nodeImplTypeMap = LoadImplTypes();
public static NonTerminalNode CreateNode(Rule rule)
{
Debug.Assert(rule != null);
NonTerminalNode node = null;
if (_nodeImplTypeMap.Contains(rule.Index))
node = _nodeImplTypeMap.CreateInstance(rule.Index);
else
node = new NonTerminalNode();
node.Rule = rule;
return node;
}
private static IEnumerable<int> GetRuleIds(Type t)
{
return t.GetCustomAttributes(typeof(SyntaxNodeAttribute),
false).Select(attr => (int)((SyntaxNodeAttribute)attr).RuleConstant);
}
private static TypeLoader<NonTerminalNode, int> LoadImplTypes()
{
TypeLoader<NonTerminalNode, int> loader = new TypeLoader<NonTerminalNode, int>();
loader.SearchDirectories = false;
loader.LoadMany(GetRuleIds);
return loader;
}
}
A SQL statement that looks like SELECT * FROM soundtrack WHERE bitrate = 128000
results in a parse tree like:
<Select Stm> [Rule Id=RULE_SELECTSTM_SELECT Class=SelectStatement]
SELECT
<Columns> [Rule Id=RULE_COLUMNS_TIMES Class=Columns]
<Restriction> [Rule Id=RULE_RESTRICTION Class=NonTerminalNode]
*
<Into Clause> [Rule Id=RULE_INTOCLAUSE Class=NonTerminalNode]
<From Clause> [Rule Id=RULE_FROMCLAUSE_FROM Class=FromClause]
FROM
<Id Member> [Rule Id=RULE_IDMEMBER_ID Class=NodeWithId]
soundtrack
<Join Chain> [Rule Id=RULE_JOINCHAIN2 Class=NonTerminalNode]
<Where Clause> [Rule Id=RULE_WHERECLAUSE_WHERE Class=WhereClause]
WHERE
<Pred Exp> [Rule Id=RULE_PREDEXP_EQ Class=EqualityNode]
<Value> [Rule Id=RULE_VALUE_ID Class=NodeWithId]
bitrate
=
<Value> [Rule Id=RULE_VALUE_INTEGERLITERAL Class=IntegerLiteral]
128000
<Group Clause> [Rule Id=RULE_GROUPCLAUSE Class=NonTerminalNode]
<Having Clause> [Rule Id=RULE_HAVINGCLAUSE Class=NonTerminalNode]
<Order Clause> [Rule Id=RULE_ORDERCLAUSE Class=NonTerminalNode]
Once the parse tree is built, then it is just a matter of traversing it in order to generate the appropriate LINQ constructs or expressions appropriate to each node. For instance, in the SelectStatement
(the root node for a SELECT
statement), the evaluation function uses the various sub clauses with each child node, further modifying the data passed in according to its SQL semantics and its own branch of the parse tree:
public override IEnumerable<IDictionary<string, object>> Evaluate(
IEnumerable<IDictionary<string, object>> data)
{
if (WhereClause != null)
data = WhereClause.Evaluate(data);
data = EvaluateAggregates(data);
if (HavingClause != null)
data = HavingClause.Evaluate(data);
if (OrderByClause != null)
data = OrderByClause.Evaluate(data);
return PostProcessResults(data);
}
The OrderByClause
uses the OrderBy
and ThenBy
extensions to do ordering. Similarly, the GroupByClause
uses LINQ's grouping constructs to roll up the source data:
public IEnumerable<IDictionary<string, object>>
Evaluate(IEnumerable<IDictionary<string, object>> source,
IEnumerable<AggregateNode> aggregates)
{
IList<IDictionary<string, object>> list =
new List<IDictionary<string, object>>();
foreach (NodeWithId item in GroupByItems)
{
var groupBy = from d in source
group d by d.ContainsKey(item.LookupId) ?
d[item.LookupId] : DBNull.Value
into g
select g;
foreach (var g in groupBy)
{
IDictionary<string, object> dict =
new Dictionary<string, object>(StringComparer.InvariantCultureIgnoreCase);
dict.Add(item.LookupId, g.Key.ToString());
foreach (AggregateNode aggregate in aggregates)
dict.Add(aggregate.LookupId,
aggregate.Evaluate<object>(g.AsEnumerable()));
list.Add(dict);
}
}
return list;
}
public IEnumerable<NodeWithId> GroupByItems
{
get
{
return FindDescendants<NodeWithId>();
}
}
Building Expressions
In order to implement the WHERE
and HAVING
clauses, heavy use is made of System.Linq.Expressions
. This made implementing that logic pretty straightforward as most SQL operators have a direct analog in the Expression
namespace. The one glaring exception is LIKE
, which I've implemented using a RegEx. The main task for expression evaluation is to generate a Func<IDictionary<string, object>, bool>
that can then be used with the LINQ Where<T>
extension method. All of this work is done with a class hierarchy that starts with the PredicateNode
. This abstract base class implements basic functionality like building expressions to index the input IDictionary
, type coercion, and navigation to sub predicates and operands.
public Func<IDictionary<string, T>, bool> CreateEvaluationFunction<T>()
{
ParameterExpression param = Expression.Parameter(typeof(IDictionary<string, object>),
"arg");
return Expression.Lambda<Func<IDictionary<string, object>, bool>>
(CreateOperatorExpression(param, GetLeftExpression(param)), param).Compile();
}
protected abstract Expression CreateOperatorExpression(ParameterExpression param,
Expression left);
Type Coercion
Because the data type of each field in the input data is not known when the expression is built, type coercion is somewhat lax. If an expression is being compared to an integer literal, for example, a Expression.Convert
will be used to coerce the input to an integer (ultimately using System.Convert
). If the type cannot be determined from a literal, there is a simple system of fall backs. Arithmetic expressions will default to the real domain; for instance, boolean operations defaulting to bool
, and most other predicates falling back to string
s.
private LiteralNode FindCoercionType(int index)
{
if (index != 0 && index != 2)
return null;
LiteralNode node = FindChild<LiteralNode>(OppositeSide(index));
if (node == null && (this.Index == 0 || this.Index == 2))
node = Parent.FindChild<LiteralNode>(OppositeSide(this.Index));
if (node == null)
{
PredicateNode predicate = FindChild<PredicateNode>(OppositeSide(index));
if (predicate == null &&
(this.Index == 0 || this.Index == 2))
predicate = Parent.FindChild<PredicateNode>(OppositeSide(this.Index));
if (predicate != null)
node = predicate.GetExpressionType();
}
return node;
}
Indexing the Dictionary
In this example usage of dynamic expression building (media file tag collections), not every row in the input data will have the same set of fields. This is because not every media file has the same set of tags embedded within it. For this reason, an extra step is taken when indexing each dictionary, first checking ContainsKey
, and returning DbNull
if that returns false
, basically allowing the same set of fields to be evaluated for every row, and allowing every row to be conceptually NULLABLE
. This pattern of first checking ContainsKey
shows up in a number of places in the code.
private ConditionalExpression CreateDereferenceExpression(ParameterExpression param,
int index)
{
NodeWithId idNode = FindChild<NodeWithId>(index);
if (idNode != null)
{
MethodInfo indexerMethod = typeof(IDictionary<string, object>).GetMethod(
"get_Item");
MethodInfo containsKeyMethod = typeof(IDictionary<string, object>).GetMethod(
"ContainsKey");
LiteralNode node = GetTypeCoercionNode(index);
Expression key = Expression.Constant(idNode.EvaluationId);
Expression containsKey = Expression.Call(param, containsKeyMethod, key);
Expression returnValue = Expression.Convert(Expression.Call(param,
indexerMethod, key), node.ValueType, node.CoercionDelegate.Method);
Expression returnNull = Expression.Constant(node.NullRepresentation,
node.ValueType);
return Expression.Condition(containsKey, returnValue, returnNull);
}
return null;
}
Putting them all Together
With all of the heavy lifting done in the PredicateNode
base class, each derived class just needs to provide the correct Expression
in its override of CreateOperatorExpression
:
[SyntaxNode(RuleConstants.RULE_PREDEXP_EQ)]
public class EqualityNode : PredicateNode
{
public EqualityNode()
{
}
protected override Expression CreateOperatorExpression(
ParameterExpression param, Expression left)
{
return Expression.Equal(left, GetRightExpression(param));
}
}
Retrieving and Evaluating Data
With the current implementation, evaluating a set of data is a two step process. First, the client code has to retrieve the larger set of data to be evaluated. It can use the parse tree to guide the retrieval, but the actual de-serialization out of the source data store exists entirely outside of the SqlLinq assembly. Second, it uses the SqlLinq code to evaluate the data which applies the logic for everything but the FROM
clause. Eventually, I'd like to explore IQueryable<T>
in order to eliminate the two part sequence. The good thing with the current two step implementation is that the SqlLinq code knows nothing about the nature of the data other than that it can be represented as an IList<IDictionary<string, object>>
. The one place where this is compromised slightly is that since this test case searches folder hierarchies, the FromClause
does not expose the JOIN
chain as one would find with a truly relational data source. The FROM
clause contains the comma separated list of "tables" to evaluate: SELECT * FROM media, file
, which will return all media and file properties.
Points of Interest
I learned a lot in writing this article, from how to get tags out of a media file, to the details parsing and parse trees, and then on to LINQ and LINQ Expressions. For a procedural/OO guy like myself, LINQ took some time to wrap my head around. I hope to keep extending this as time allows and see what other sorts of data stores can be evaluated with textual SQL statements...
Another point of interest is that the pluggable architecture makes it straightforward to add additional types of file queries. For instance, the downloadable code includes an Assembly file property reader that can be used to find assemblies that match custom search criteria.
Without too much more work, this could be turned into a very generically reusable file search and analysis application. Maybe, as I continue to tinker with it, that's what it will become.
Things to Do
- Add support for
IQueryable<T>
and provide an implementation that can go directly at the data rather than requiring the client code to retrieve it and then evaluate the results. - Implement
UPDATE
statement.
History
- July 27, 2008 - First posting.
- July 28, 2008 - Added
LIKE
, NOT LIKE
, IN
, and NOT IN
operators. - October 20, 2008 - Refactored most of the lower layers to allow for pluggable file property readers.
- November 2, 2008 - Some minor code clean-up and article updates. Also replaced syntax highlighting with an implementation that supports undo/redo.