(untagged)

Advanced text searching

Cd-MaN

0.00/5 (No votes)

4 Mar 2005

Building a simple query language with the OR and AND boolean operators

Download source - 3 Kb

Introduction

When dealing with large amounts of data (either a website or in other specialized programs) it is very good to offer the user the possibility to narrow down the subset of data he needs to look at. The only way to avoid this and still not confuse the user is to partition the data in small chunks (such as presenting an address book by the first letter of each name). This isn't always possible so a search system can be helpful.

Description of the solution

I've developed a class which implements a simple query language. The syntax is presented below by some elementary examples and their interpretation:

`abc cde fgh`	The character sequences "abc", "cde" and "fgh" must be all present in the string, however their order doesn't matter (so cdeabcfgh is a valid sequence for this rule). Every blank character is considered a separator (actually I use the `Char.IsSeparator` method to determine if a character is separator)
`"abc cde"""`	The sequence abc cde" must be present in the string. Observe that in order to include the quote character ("), you must double it. Besides of grouping words the quote character is also useful in inserting special characters ( (, ), [, ]).
`[a b c]`	The string must contain at least one of the elements (OR).
`(a b c)`	The string must contain all the elements (the order is not important!)
`[(a b) (c "[")]`	The parenthesizes can be nested. The expression means: the string either contains a AND b (the order is not important) OR it contains c and [ (observe how the special character [ needs to be between quotes)

A more formal description of the syntax would be:

queryString ::= <queryPart>[<empty characters><queryString>] queryPart ::= <keyword>|<orComposition>|<andComposition> orComposition ::= [<queryPart>{<empty characters><queryPart>}] andComposition ::= (<queryPart>{<empty characters><queryPart>}) emptyCharacters ::= as defined by the Char.IsSeparator method keyword ::= "<characters>"|<non empty, non special characters>

This syntax may appear a little strange and has the drawback that it's non-standard. Because of this it might require a little getting used to from the users perspective. I don't recommend using it as the default search language on a website, however it can be presented as an alternative (for advanced searching for example). Also, I found it to be a little more obvious than the normal approach (if you write "Anna and Betty", what do you want to search for: for the exact string or for a string containing both words?) and it's simpler to implement.

To search using this code, you must do the following simple steps:

Create a search tree and remember its root
Dim x As SearchTreeNode = SearchTreeNode.ContructSearchTree("[a b] [c d]")
Now call the .MatchesString(string to match, root) function for each element you want to verify. It returns true or false depending whether the supplied string matches the rule or not.

Implementation details

The SearchTreeNode is a node of a multi-way tree which remembers it's first child and next sibling as shown below:

The nodes of the tree can be of three types: AND nodes, OR nodes and KEYWORD nodes (or leafs). An AND node matches a string if all of its children match the string. An OR node matches a string if at least one of its children match the string. A keyword node matches a string if it is contained within the string. For example the following rule is translated to the tree seen below: "[(a b) (c d)] e"

You probably observed the extra level with an AND node. This is done to avoid treating a particular case in the routine and has no impact on the run. The above tree is constructed with the ConstructSearchTree static (shared) method. This method throws an exception if it's unable to construct the tree. The message of the exception tries to hint to a possible cause of the syntax error.

The MatchesString static method needs a string which it should compare to the rule and the root of the tree. It then performs a recursive walk of the tree using the MatchString private method and returns true or false.

Some notes on the performance

I'm actually quite satisfied with the performance of the code. It can search 10000 strings containing 100 "x" characters with the rule "[a b] [c d]" in 16 milliseconds. This represents the worst case scenario for the leafs (where the actual searching is done using the indexOf method), because no match is found. Also in this case the full tree must be walked, so with a string that actually matches the rule the search could be aborted much sooner. The tests were done on an AMD Sempron 2200+ machine with 256 MB RAM running WinXP SP2.

One possibility to speed up the matching would be to implement a faster string matching algorithm like Boyer-Moore or TurboSearch. The helper tables for these could be generated during the creation of the tree. The problem is that the strings are Unicode encoded and thus a helper table (for the TurboSearch algorithm for example) could reach very big sizes (64K elements). Also there is the fact that the keywords will probably be rather small (5-10 charactes) and the indexOf method performs quite well.

Additional features

There are two methods: one for building the filter part of an SQL query from the tree and one to build a regular expression from it. They are both static (shared) methods and their syntax is as follows: BuildSQLQuery(field name, root element) returns a string that looks like: (<field name> LIKE "%keyword%" AND ... In the keyword all occurrences of the quote character (") are replaced with backslash quote (\").

The second method builds a regular expression from the search tree: BuildRegEx(root element) and returns the regular expression as string. One limitation of this method is that it doesn't generate a regular expression that handles all possible orders or appearances of the supplied argument, just the one defined in the tree. I used this approach because given the limitations of regular expressions a full implementation would generate very long and complicated expressions (one would have to explicitly specify all possible permutations in the expression).

Final notes

I hope that you found this article useful. If you found this article stupid, annoying, incorrect, etc. express this fact by rating the article as you see fit. However I would appreciate if you would leave a comment explaining your reasons so that I can (hopefully) learn from my mistakes.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here