Introduction
The article describes an algorithm used to build "autocompletion
lists" for a textbox in a GUI:<o:p>
The article has the Demo application attached with an implementation of
the algorithm. The Demo is implemented on C# as a Silverlight Application. The
Silverlight application is already deployed on http://files.rsdn.ru/44022/AutocompleteTextBox.html; you can try it
right now without downloading and compiling the attached sources.
Audience
The possible audience of the article are developers that want to
implement a "smart" autocompletion algorithm in their applications. The word
"smart" means here the following: an algorithm that calculates the "relevancy
rank" for the matched "auto-suggestion strings", and orders found results by decreasing of the
"rank".
Disclaimers and
restrictions
The Demo attached is not a ready-to-use "control" you can
"drag and drop" into your application. It is rather an example how to implement
the algorithm described in the article.
The attached Demo is a Silverlight application; but of course this approach may be used in practically any type
of GUI applications – on a Web Page (with JavaScript
usage), in a desktop application etc.
The algorithm implementation in the Demo has several "hardcoded" facts
that are specific for English language only (a list of the "2nd
class words" like "the"). So the Demo
should be used with English queries only. If you want to use the algorithm for
other languages, then populate the
list with the "2nd class words" specific for your language.<o:p>
Background
The idea of the "autocompletion" may be defined as "to suggest an
"autocompletion list" when a user is typing a text in a text control". This technique has become a
de facto standard to the present
time. Below are the screenshots of several well-known applications with this
feature:
And so on and so forth.<o:p>
Microsoft, for example, recommends
the Autocompletion use in its "Designing great productivity apps for Windows"
article (http://msdn.microsoft.com/en-us/library/windows/apps/hh868273.aspx):<o:p>
The Autocompletion feature has been added into various GUI controls in
various toolkits. For example, .NET Silverlight has a control called AutoCompleteBox (http://msdn.microsoft.com/en-us/library/system.windows.controls.autocompletebox(v=vs.95).aspx). .NET WPF Toolkit contains its own AutoCompleteBox too (http://iserialized.com/using-the-autocompletebox-in-the-wpf-toolkit/). These are good controls; but their "matching
filters" are very simple. I.e.
the filters that "match" the
entered text against the "array of possible values" and make the
"autocompletion list". These filters suggest the following "matching rules":
Contains, Starts With, Equals (with case-sensitivity or case-insensitivity);
and this is all. And these filters, of course, do not try to order the matched
values "by decreasing of relevance".
This is why I tried to implement an algorithm that is "smarter" than the ones used
in typical
"autocompletion controls".
Inputs, outputs and usage scenario of the algorithm
The algorithm supposes there is an application with a text
field (a "textbox" etc). The field has a list of "possible values" for the
auto-completion feature.
The attached Demo is an example of such application. It
issues search queries to a database of movies:
But the nature of the queries does not matter for the algorithm, of course.
When a User of the application starts typing a text in
such a field, the algorithm matches
the entered text (we call the text as Query)
against the Possible Values. The algorithm returns a
list of the "matched" Possible Values. The list is sorted
in descending order by "relevance".
The "relevance" here is a degree of relevance between a Possible Value and given Query.
The application is supposed to display the returned list
as the "auto-completion list" for the given Query:
The Main Ideas
of the algorithm
The "similarity factor"
The main idea of the algorithm
is usage of what we call "similarity
factor".
In other words, the
algorithm considers the following question: "how similar" are given Possible Value and Query to each other? The algorithm calculates a
number called SimilarityRank than
represents a "degree of similarity" between them. The output list of the
algorithm is ordered by the decreasing SimilarityRank. Thus, a SimilarityRank value calculated for a
(PossibleValue,
Query) pair represents "relevance"
of the PossibleValue to the Query.
Of course, the Possible Values that "are not
similar at all" to given Query will not be included into the output list.
An example:
On the picture
above we see that the algorithm has "matched" several of Possible Values
against the entered Query "st". The
value "Streets" has largest SimilarityRank and is at 1st position of
the output list. The value "Streets of Fire" has lower SimilarityRank and is at
2nd position of the output list; and so on.<o:p>
How the SimilarityRank is calculated
The main ideas used for the SimilarityRank calculation are the following:
- The
string comparison (used for the "matching")
is case-insensitive
- The
"matching" of a PossibleValue against a Query is
a "word-wise" operation. In other words, both the PossibleValue and Query will be "split" into words.
We suppose a Possible
Value "matches" against
a Query when all the Query’s words are
found in the Possible
Value, in the same order.
The "found in the Possible
Value" means: a Query’s
word should be equal to a Possible Query word; or to be a prefix of a Possible Value.
An example: suppose we have a Query "leading
spaces". Then a PossibleValue "the leading and trailing Spaces" matches the Query. But a PossibleValue "spaces that are leading or trailing" does not
match the "leading spaces" – both the
Query’s words are found in the Possible Value; but not in the proper order.
Another example: suppose we have a
Query "lead space". Then a PossibleValue
"the leading and trailing spaces" matches the Query. But a PossibleValue "cheerleaders and spaces" does not match the Query – the word "lead"
is found in the word "cheerleaders",
but not in the very begin, so it is not a "prefix" of the word "cheerleaders".
A "not matched" Possible Value will
not be included into the output list of the algorithm.
- For every
pair of a Query word and its
"matching" PossibleValue word we use the following rule: the closer are their lengths to each other, the more "similar" are the
words.
An example:
suppose we have a PossibleValue word
"relativeness" and a Query word "rel". They match each other, all right. But
the "degree of similarity" is not very high: length of the
"rel" is only 25%
of the "relativeness" length.
Now let’s suppose the Query word is
"relative" instead of the "rel". The "degree of similarity" between the "relative"
and "relativeness" is essentially
higher than the "degree of similarity" between the "rel" and "relativeness". Length of the "relative"
is 66% of the "relativeness" length.
We call this as "words length
factor".
- We increase the SimilarityRank
for every pair of a Query word and its
"matching" PossibleValue word if the following situation is met:
- a Query word
contains capital letters;
and
- the
occurrence of the Query word within
the Possible Value is exactly
(case-sensitively) equal to the Query word;
An example: suppose we have a Query "Main" and two
Possible Values: "Maine" and "maine".
They both math the Query; but SimilarityRank of the "Maine"
is higher. The reason is: the Query "Main"
contains a capital letter. And the substring "Main" that is an occurrence of the Query in the "Maine" is exactly (case-sensitively)
equal to the Query; whereas the substring "main"
from the second PossibleValue is not.
We call this as "capital letters factor".
- For every pair of a Query word and its "matching" PossibleValue word
the SimilarityRank depends of a "word
class" the Possible Value word belongs to.
If the PossibleValue
word belongs to "1st class", the SimilarityRank is higher. If the
Possible Value word belongs to "2nd class", the SimilarityRank is
lower.
The "class" of a word is defined by
the following rule (for English only; other languages may require other rules):
the "2nd class" consists of
words like articles, prepositions etc. These words are "less important"
than others. In some search
engines they are called "stop words" and are ignored at all. Our algorithm does
not ignore them completely; but considers them as "less important ones". And
the "1st class" is all the other words except the "2nd
class" ones.
In the current algorithm implementation
the following words belong to the "2nd class": "the",
"a", "at", "in", "on", "of", "off",
"into", "onto", "by".
Of course, the list may be extended.
We call this as "word class factor".
- The closer is
position of a "matched" a PossibleValue word
to begin of the phrase, the higher is the SimilarityRank.
An example:
suppose we have a Query "green" and 2
Possible Values: "green light" and "light green". They both math the Query; but SimilarityRank of the "green
light" is higher. The reason is: the word "green" is located in 1st
position in the "green light" and
only in 2nd position in the "light green".
We call this as "word position factor".
- The closer is
length of a Query to length of a "matched" PossibleValue, the higher is the SimilarityRank.
We call this as "phrase length factor".
An example:
suppose we have a Query "green" and 2
Possible Values: "green light" and "green light in the window tonight". They both
math the Query; but SimilarityRank
of the "green light" is higher. The reason is:
length of the Query "green" is ~50% of length of the "green light"; and is only ~15% of the "green light in the window tonight".
Details of the algorithm
In the previous section I described the main ideas
of the algorithm. In this section I will give the full description – how a numeric
value SimilarityRank is
calculated for a pair (Query, PossibleValue).
I recommend you to look at the attached sources while you
are reading the description below.
- Split
both the Query and PossibleValue on
separate words using the following separators: {space}, {tabulation}, !, ., {comma}, ;, (, ), \, /, +, -, :, ",
[, ], ?, {, }, |, {long dash}, {short dash}.
- Find
all the occurrences of the Query in the PossibleValue. An "occurrence"
happens when:
- every word of a Query is equal to a PossibleValue word; or is a prefix of a PossibleValue word;
and
- the matched
PossibleValue words are situated in the same order as their corresponding Query words
The comparison of the words
is performed case-insensitively.
An example: suppose we have PossibleValue "Aa b c a bb" and Query "a b". The PossibleValue
has two "occurrences" of the "a b": "Aa
b c a bb" and "Aa b c a bb".
An example: suppose we have PossibleValue = "a b" and Query = "b a". The PossibleValue
has no "occurrences" of the "b a", Indeed, the PossibleValue
contains the "a" and "b" words, but not in the order the Query contains them.
- If the PossibleValue
has no occurrences of the Query, the SimilarityRank is 0. Exit the algorithm.
- Calculate a value
called PhraseLengthFactor for the given (Query,
PossibleValue) pair. See below
for the details of the calculation.
- Calculate a value called SimilarityRankOfTheOccurence for each
occurrence in the following way:
5.1. Calculate a value called WordsSimilarityRank for every pair of
(<word of Query>, <word of PossibleValue that matches the word of
Query>). It shows a "degree of
similarity" between the words.
An example: suppose we
have PossibleValue "aa
bbb c d" and Query = "a b". Then we calculate a value WordsSimilarityRank for the
pair ("a", "aa"); and another value WordsSimilarityRank for the pair ("b", "bbb");
See below for the details of the
calculation.
5.2. Calculate a value called WordPositionFactor for every pair of
(<word of Query>, <word of PossibleValue that matches the word of
Query>).
An example: suppose we
have PossibleValue = "aa
bbb c d" and Query = "a b". Then we calculate a value WordPositionFactor for the
pair ("a", "aa"); and another value WordPositionFactor for the pair ("b", "bbb").
See below for the details of the
calculation.
5.3. Calculate the SimilarityRankOfTheOccurence
of the given occurrence as the following:
SimilarityRankOfTheOccurence = WordsSimilarityRank * WordPositionFactor * PhraseLengthFactor
- Now we have a value
SimilarityRankOfTheOccurence calculated for each of the "occurrences" in the
given (Query, PossibleValue) pair. Find the largest one among all the SimilarityRankOfTheOccurence
values. This will be the SimilarityRank of the (Query,
PossibleValue) pair.
How the PhraseLengthFactor is calculated
On the input here we have a pair (Query, PossibleValue). They are already "split" to separate words; and
can be represented as arrays of words.
We calculate the PhraseLengthFactor as the following (a code below is
C#):
double AddendForWordWeightCalculation = 10.0;
double MaxQueryRelativeWeight = 1.0;
double MinQueryRelativeWeight = 0.5;
double weightOfQuery = 0.0;
for (int i = 0; i < Query.Words.Length; ++i)
{
weightOfQuery += Query.Words[i].Length + AddendForWordWeightCalculation;
}
double weightOfPossibleValue = 0.0;
for (int i = 0; i < PossibleValue.Words.Length; ++i)
{
weightOfPossibleValue += PossibleValue.Words[i].Length + AddendForWordWeightCalculation;
}
double prevVal = weightOfQuery / weightOfPossibleValue;
double phraseLengthFactor = MinQueryRelativeWeight + prevVal * (MaxQueryRelativeWeight - MinQueryRelativeWeight);
The resulting PhraseLengthFactor in the given algorithm implementation will be in the [0.5, 1.0] range. You can change the range if you want to increase or decrease the influence of the PhraseLengthFactor on the final SimilarityRank. For example, if you use MaxQueryRelativeWeight = 2.0 instead of the current 1.0, the influence of the PhraseLengthFactor will be higher.
How the WordsSimilarityRank is calculated
On the input here we have a pair of words
(WordOfPossibleValue, WordOfQuery). The WordOfQuery is a word of the given
Query; the WordOfPossibleValue is a word of the given PossibleValue; and this
word "matches" the WordOfQuery. An example: a WordOfQuery is"th" and a "matching"
WordOfPossibleValue word is "That".
As we stated above, a WordOfQuery must be equal to WordOfPossibleValue OR to be
its prefix to "match" it. The comparison here is case-insensitive.
We calculate
the WordsSimilarityRank as the following:
Define a constant called IncreasingForUppercases = 1.1<o:p>
Define a constant called DecreasingFor2ndClassWord = 0.2
- WordsSimilarityRank = WordOfQuery.Length / PossibleValue.Length
- IF the WordOfQuery contain uppercase letter(s)
AND
the WordOfQuery is equal to WordOfPossibleValue OR is its prefix (in case-sensitive manner) THEN
WordsSimilarityRank = WordsSimilarityRank * IncreasingForUppercases
- IF the WordOfPossibleValue belongs to the "2nd class" (see above about the "word classes")
THEN
WordsSimilarityRank = WordsSimilarityRank * DecreasingFor2ndClassWord
NOTE:
The "WordsSimilarityRank = WordOfQuery.Length / PossibleValue.Length"
formula implements what we called above as the "words length
factor".
The IncreasingForUppercases
here implements what we called above as the "capital
letters factor".
The DecreasingFor2ndClassWord here implements what we called above as the "word class factor".
How the WordPositionFactor is calculated
On the input here we have an integer value PositionInPossibleValue. This is a (zero-based)
word-wise position of a Query word in a "matched" PossibleValue. For example,
we have a Query "th" and a PossibleValue "Color of the night". The word "th" is found in the 2nd (using
zero-based numbering) word of the PossibleValue; so the PositionInPossibleValue
is 2 for the given occurrence of the Query.
We calculate the WordsSimilarityRank as the
following:
Define a constant called WordPositionFactorBonusFor1stWord = 2.0
Define a constant called WordPositionFactorAddendForCalculation = 10.0
Define a constant called WordPositionFactorMinValue = 0.3;
- WordPositionFactor = WordPositionFactorAddendForCalculation / (WordPositionFactorAddendForCalculation + positionInPossibleValue)
- IF positionInPossibleValue is 0
THEN
WordPositionFactor = WordPositionFactor * WordPositionFactorBonusFor1stWord
- IF WordPositionFactor < WordPositionFactorMinValue
THEN
WordPositionFactor = WordPositionFactorMinValue
NOTE: The WordPositionFactorBonusFor1stWord plays a role of a "bonus" when the Query word is found in the very begin of a PossibleValue. Of course, a WordPositionFactorBonusFor1stWord value must be more than 1.0.
The value WordPositionFactorBonusFor1stWord = 2.0
used in the given implementation is not a "magic number" that cannot be
changed. You can, for example, increase it if you want the "bonus for a
position in the very begin of the PossibleValue" be more
essential.
The WordPositionFactorMinValue
must be a positive number < 1.0
The implementation
The Demo attached
is a simple implementation of the algorithm on C#. The class AutocompletionEngine.TextSearcher.Processor
comprises all the logic of the algorithm. Create an object of the class,
populate it with the PossibleValues (using the AddPossibleValue method) and then call the method Search with a Query as its argument. The
method returns the "resulting list" of the algorithm – a list of "matched"
PossibleValues ordered by decreasing of their SimilarityRank.
The static list AutocompletionEngine.TextSearcher.Word.ArticlesEtc
contains the hardcoded list of the "second class" words. If you want the
algorithm to work correctly with non-English queries, populate the list with
"second-class words" of your language.
The project UnitTest contains
a set of NUnit-based automated tests of the AutocompletionEngine.TextSearcher.Processor
class. This is why the project requires nunit.framework.dll
library.