Introduction
When dealing with large amounts of data (either a website or in other
specialized programs) it is very good to offer the user the possibility to
narrow down the subset of data he needs to look at. The only way to avoid
this and still not confuse the user is to partition the data in small chunks
(such as presenting an address book by the first letter of each name). This
isn't always possible so a search system can be helpful.
Description of the solution
I've developed a class which implements a simple query language. The syntax
is presented below by some elementary examples and their interpretation:
abc cde fgh |
The character sequences "abc", "cde" and "fgh" must be all present in the
string, however their order doesn't matter (so cdeabcfgh is a valid sequence
for this rule). Every blank character is considered a separator (actually I use
the Char.IsSeparator method to determine if a character is separator) |
"abc cde""" |
The sequence abc cde" must be present in the string. Observe that in order to
include the quote character ("), you must double it. Besides of grouping words
the quote character is also useful in inserting special characters ( (, ), [, ]). |
[a b c] |
The string must contain at least one of the elements (OR). |
(a b c) |
The string must contain all the elements (the order is not important!) |
[(a b) (c "[")] |
The parenthesizes can be nested. The expression means:
the string either contains a AND b (the order is not important)
OR
it contains c and [ (observe how the special character [ needs to be between
quotes)
|
A more formal description of the syntax would be:
queryString ::= <queryPart>[<empty characters><queryString>]
queryPart ::= <keyword>|<orComposition>|<andComposition>
orComposition ::= [<queryPart>{<empty characters><queryPart>}]
andComposition ::= (<queryPart>{<empty characters><queryPart>})
emptyCharacters ::= as defined by the Char.IsSeparator
method
keyword ::= "<characters>"|<non empty, non special characters>
This syntax may appear a little strange and has the drawback that it's non-standard.
Because of this it might require a little getting used to from the
users perspective. I don't recommend using it as the default search language
on a website, however it can be presented as an alternative (for advanced
searching for example). Also, I found it to be a little more obvious than the
normal approach (if you write "Anna and Betty", what do you want to search for:
for the exact string or for a string containing both words?) and it's simpler
to implement.
To search using this code, you must do the following simple steps:
-
Create a search tree and remember its root
Dim x As SearchTreeNode = SearchTreeNode.ContructSearchTree("[a b] [c d]")
-
Now call the
.MatchesString(string to match, root)
function for
each element you want to verify. It returns true or false depending whether the
supplied string matches the rule or not.
Implementation details
The SearchTreeNode is a node of a multi-way tree which remembers it's first
child and next sibling as shown below:
The nodes of the tree can be of three types: AND nodes, OR nodes and KEYWORD
nodes (or leafs). An AND node matches a string if all of its children match
the string. An OR node matches a string if at least one of its children match
the string. A keyword node matches a string if it is contained within the
string. For example the following rule is translated to the tree seen below:
"[(a b) (c d)] e"
You probably observed the extra level with an AND node. This is done to avoid
treating a particular case in the routine and has no impact on the run. The above
tree is constructed with the ConstructSearchTree
static (shared)
method. This method throws an exception if it's unable to construct the tree.
The message of the exception tries to hint to a possible cause of the syntax
error.
The MatchesString
static method needs a string which it should
compare to the rule and the root of the tree. It then performs a recursive walk
of the tree using the MatchString
private method and returns
true or false.
Some notes on the performance
I'm actually quite satisfied with the performance of the code. It can search
10000 strings containing 100 "x" characters with the rule "[a b] [c d]" in 16
milliseconds. This represents the worst case scenario for the leafs (where the
actual searching is done using the indexOf method), because no match is found.
Also in this case the full tree must be walked, so with a string that actually
matches the rule the search could be aborted much sooner. The tests were done
on an AMD Sempron 2200+ machine with 256 MB RAM running WinXP SP2.
One possibility to speed up the matching would be to implement a faster string
matching algorithm like Boyer-Moore or TurboSearch. The helper tables for these
could be generated during the creation of the tree. The problem is that the
strings are Unicode encoded and thus a helper table (for the TurboSearch
algorithm for example) could reach very big sizes (64K elements). Also there is
the fact that the keywords will probably be rather small (5-10 charactes) and
the indexOf
method performs quite well.
Additional features
There are two methods: one for building the filter part of an SQL query from the
tree and one to build a regular expression from it. They are both static
(shared) methods and their syntax is as follows: BuildSQLQuery(field name,
root element)
returns a string that looks like: (<field name>
LIKE "%keyword%" AND
... In the keyword all occurrences of the quote character
(") are replaced with backslash quote (\").
The second method builds a regular expression from the search tree: BuildRegEx(root
element)
and returns the regular expression as string. One
limitation of this method is that it doesn't generate a regular expression that
handles all possible orders or appearances of the supplied argument, just the one
defined in the tree. I used this approach because given the limitations of
regular expressions a full implementation would generate very long and
complicated expressions (one would have to explicitly specify all possible
permutations in the expression).
Final notes
I hope that you found this article useful. If you found this article stupid,
annoying, incorrect, etc. express this fact by rating the article as you see
fit. However I would appreciate if you would leave a comment explaining your
reasons so that I can (hopefully) learn from my mistakes.