FYI I'm talking about this....
http://www.youtube.com/embed/ZfdAwV0HlEU
Preamble
So a project required me to take the following statement
"C#" or "Vb.net" and Not "Java"
and create a filter based on it that equates to the following logic.
if((file.Contains("C#")||file.Contains("Vb.net"))&&(!file.Contains("Java")))
{
SearchResult.Add(file);
}
This is what we in the biz refer to as a Mini-Language. Essentially an especially small and (hopefully) easy-to-understand programming language that we use for a very specific purpose, a type of Domain Specific Language(DSL for short). To achieve this we turn to a process called Lexical analysis.
- Breaking down the expression into manageable parts. Referred to as the "Scanner" or "Tokenizer"; This is the first half of the lexical analysis process.
- Transforming those parts into a form can be evaluated by the system that needs to be interacted with a.k.a. a "Processed Value". (be it db access, programming logic, a modeling schema whatever) Referred to as the "Evaluator" or "Parser" This is the second half of the lexical analysis process.
Why This? Why Now?
There are numerous ways that this can be done (and I was back and forth on a couple) and this is not the first time I've had the pleasure of working through them however, its not terribly often so I'm writing this as a means of solidifying it into my memory or at least having something readily available so I don't "Forget to remember" how to do this sort of thing.
Stuff You should know
The programming logic doesn't contain anything that is overtly language specific(Though I do use linq at spots for filtering out results. This can just amount to looping through a series of value the filtered action based on a specific set of criteria). The most specialized tool in my arsenal for this project is Regular Expressions (an exceptionally powerful string search and evaluation DSL that IMO you should really have some familiarity with as a programmer as they have totally saved my ass a ton of extra time and work in many cases). If your existing language does not support them or something equally powerful you are gonna have a bad time.
The Plan
What will it look like?
Hacking away at this in Linqpad the following represents the way this would be envisioned
void Main()
{
var expression = "\"java\" and \"SQL\" and \"C#\" or \"VB.NET\"";
var tokens = Tokenize(expression);
var ConditionNode = Eval(tokens);
var endResult = CreateQueryString(ConditionNode );
}
So that is our path to paradise.
The Tokenizer AKA Scanner's job is breaking down string expressions into manageable parts or "Tokens".
The Tokenizer
So begins the tokenization process. It amounts to taking our users expression
"C#" or "Vb.net" and Not "Java"
and splits it into pieces such that we end up with a collection of strings that looks like the following.
- C#
- or
- Vb.net
- and
- Not
- Java
If you still haven't learned Regular Expressions (seriously go freakin learn them), this is a summary or what it does (more or less).
Looks for an optional not followed by whitespace as a capture group to store the "not" then an expression starting with a double, containing any number of characters and and ending with a double quote " (this is our representation of a string) followed optionally by at least one character of whitespace and a logical operation that identifies the relationship of the condition to it sibling on the right if there is one. This will evaluate the entire length of the string and capture each fitting expression accordingly. See line 12.
public static List<string> Tokenize(string expression)
{
var tempExpression = System.Security.SecurityElement.Escape(expression);
string doubleQuote = """;
var regEx = string.Format("(not\\s+)?{0}(.+?){0}(and|or)?",doubleQuote);
Regex RE = new Regex(regEx);
var result = (RE.Split(tempExpression)).Select (r => r.Trim());
return result.Where (re => !String.IsNullOrWhiteSpace(re)).ToList();
}
and with that we have our collection of tokens.
Make the required preparations. For we shall parse as dawn!
So before we can parse our tokens into something of interest we have to decide what data we need to retrieve and what will we need to do with it. Looking once again at our requirements and text
"C#" or "Vb.net" and Not "Java"
if((file.Contains("C#")||file.Contains("Vb.net"))&&(!file.Contains("Java")))
{
SearchResult.Add(file);
}
So we established before that we were going to need a basic string to hold the content that we were filtering but we also have theNOT (!), OR (||) and AND (&&) to deal with. In computer programming as well as natural language (I.E. the English language in this case) and just about any other Syntax these are referred to as Logical Connectives or more commonly Logical Operators. Logical operators are what allow us communicate multiples of ideas without a lot of additional dialog (i.e. Compound Sentences). Without them the sentence
"Jack and Jill went up the hill."
Becomes
"Jack went up the hill."
"Jill went up the hill."
(UGH.... I think I'd go a bit crazy if people talked like that all the time.)
Logical Operators do a ton more than that but, for our requirements this should be enough to give you an idea of their significance at the core of most any and all forms of language, programming or otherwise.
So with that we have a class with which to parse.
To recap:This is what we have so far is our "condition". What gives our parser its reason for being.
class Condition
- string expression
- bool HasNot
- LogicalOperator LogicalOperator
To elaborate on the "LogicalOperator" type is just an enumeration of all the possible Logical Operators we are concerned with and looks like this.
enum LogicalOperator
{
None = 0,
And = 1,
Or = 2,
}
Now we can probably stop here and just make a collection of "Conditions" and maybe get away with it for now. However that would not accurately reflect the relationship between the items in the collection and will, more than likely, make implementing our parser (and thus maintenance it) much more difficult in the long run should we want to implement more language features (such as additional operators) to make our baby DSL into a strong and able bodied language one day. Because of that we are going to be building a Tree data structure using Nodes
Nodes .... NODES EVERYWHERE!!!!
For our purposes a node is a data structure that can have a concept of its relationship to its neighbor(s). A group of these nodes with relationships to each other become a tree. Now, as with most of the point of discussion, there are tons of other areas of study and topic of discussion in this area of computer science and I encourage you to continue forward in checking out all the neat stuff people do with nodes and trees but, like so many of the things covered in this post, I aim to keep it simple.
So with that I have here a simple node example in C#
public class Node
{
public string Value{get;set;}
public Node Right {get;set;}
}
At its heart that is pretty much it. The more interesting pieces or on how one goes about navigating these things. Not a lot too that either ... in C#
public static Node MoveRight(this Node node)
{
return node.Right;
}
Note* that this does not account for Null nodes. It will break on a null node and throw a null ref exception. If you indented to navigate to the last node in a tree it would just be a matter of recursion. //goes to the last node in the tree
public static Node GetRightMostNode(Node node)
{
while(node.Value!=null)
{
node = node.MoveRight();
}
return node;
}
Same thing if I wanted to get all the values in a tree
public static string GetValues(Node node)
{
StringBuilder sb = new StringBuilder();
while (node.Value!=null)
{
sb.Append(node.Value);
node = node.Right;
}
return sb.ToString();
}
Converting a collection of strings to nodes
public static void RecurseSet(this Node node, List<string> vals)
{
Node tNode;
for(var i=0;i<vals.Count();i++)
{
tNode = new Node();
tNode.Value = vals[i];
node.Right = tNode;
vals.Remove(vals[i]);
RecurseSet(tNode,vals);
}
}
Here is a link to the full example typed up in linqpad.
Taking it Home
Armed with additional node and tree xp lets go back and preform some "augmentations" to our condition class.
We now have...
public class Condition
{
public int Order {get;set;}
public string expression {get;set;}
public LogicalOperator LogicalOperator {get;set;}
public bool HasNot{get;set;}
public Condition Right {get;set;}
}
Note* the order was used as a means of validating the order of the node tree from the token collection and is completely optional. The traversal methods are just about the same.
public static Condition MoveRight(this Condition node)
{
return node.Right;
}
Finally we just need to Transform our tree into whatever form we require. In this case we will be using it to build out our C# expression's if statement .... just because.
public static string CreateCode(Condition node)
{
StringBuilder sb = new StringBuilder();
sb.Append("if(");
while (node!=null)
{
if(node.HasNot)
{
sb.Append("!");
}
sb.Append("val.Contains==");
sb.Append("\""+node.expression+"\" ");
if(node.LogicalOperator!=LogicalOperator.None)
{
sb.Append(GetLogicOperationSymbol(node.LogicalOperator) +" ");
}
node = node.Right;
}
sb.Append(")");
sb.AppendLine("");
sb.AppendLine("{");
sb.AppendLine("//do stuff");
sb.AppendLine("}");
return sb.ToString();
}
public static string GetLogicOperationSymbol(LogicalOperator op)
{
var retval= "";
switch (op)
{
case LogicalOperator.And:
retval= "&&";
break;
case LogicalOperator.Or:
retval= "||";
break;
case LogicalOperator.Not:
retval= "!";
break;
default:
break;
}
return retval;
}
And that ladies and laddies is but one way of building out your own itty-bitty baby DSL.
Additional things I'd like to point out.
- Instead of a condition class the entire class could be split into node tokens with even simpler parts. This implementation works for me but you are more than free to experiment with more tree node goodness.
- There are tools out there such as
that preform many of the parsing and tokenizing related tasks mentioned in this post with ease and then some now that you have a handle on a hand rolled implementation.
I first learned about Mini-Languages, DSLs,among a ton of other awesome stuff reading The Art Of Unix Programming. A freely available book by Eric Steven Raymond(one of the godfathers of the open-source movement).