Introduction
The article evolves the algorithm described in my previous article "An
algorithm of "phrases similarity calculation" and its usage in "autocompletion"
feature"
The new algorithm described below uses another factor called "popularity factor"
to calculate the "relevancy rank" of the autocompletion list items.
The article has a Demo application with an implementation of the algorithm attached.
The Demo is implemented on C# in form of a Silverlight Application. The Silverlight
application is already deployed on
http://files.rsdn.ru/44022/AutocompleteTextBox2.html; you can work with it there
without the downloading and compiling the attached sources.
Audience
The possible audience of the article are developers that want to implement a
"smart" autocompletion algorithm in their applications. The word "smart" means here
the following: an algorithm that calculates the "relevancy rank" for the matched
"auto-suggestion strings", and orders the found results by decreasing of the "rank".
A good example of such an algorithm is the well-known movies database www.imdb.com
:
As you see on the screenshot above, the www.imdb.com not just displays "all the
results that contain the entered text ‘go’ as a sub-string". It displays the "most
relevant" found results. One of the criterions of the "relevancy" here is popularity
of a found movie/person. Apparently the
www.imdb.com database contains a
lot of movies that have a title starting with "go". But the "The
Godfather" displayed above is a very popular movie. This is why
its "relevancy rank" is considered to be "high enough" to be displayed in the begin
of the "autocompletion list".
Disclaimers and restrictions
It is necessary to read the previous "An algorithm
of "phrases similarity calculation" and its usage in "autocompletion" feature"
article prior to reading the given article.
The Demo attached is not a ready-to-use "control" you can "drag and drop" into
your application. It is rather an example how to implement the algorithm described
in the article.
The attached Demo is a Silverlight application; but of course this approach may
be used in practically any type of GUI applications – on a Web Page (with JavaScript
usage), in a desktop application etc.
The algorithm implementation in the Demo has several "hardcoded" facts that are
specific for English language only (a list of the "2nd class words" like "the").
So the Demo should be used with English queries only. If you want to use the algorithm
for other languages, then populate the list with the "2nd class words" specific
for your language.
Background
In the previous article "An algorithm of "phrases
similarity calculation" and its usage in "autocompletion" feature" I introduced
the algorithm that uses a "degree of similarity
between phrases" to calculate the "Rank" of the autocompletion list items.
In this article I improve the algorithm by introducing another factor that contributes
to the Rank – a "popularity factor". The factor says – "how popular" was
the given "autocompletion list item" among the user(s) of the system. The more "popular"
the item was, the more its Rank is.
A NOTE: the new "popularity factor" does not cancel the "phrases similarity factor"
described in the previous article. No – they both contribute to
the Rank.
Inputs, outputs and usage scenario of the algorithm
The algorithm supposes there is an application with a text field (a "textbox"
etc) in its GUI. The field has a list of "possible values" for the auto-completion
feature. The application users issue some kind of search queries. The attached Demo
is an example of such application that issues the search queries to a database of
movies:
But the nature of the queries does not matter to the algorithm, of course.
When a User of the application starts typing a text in such a field, the algorithm
matches the entered text (we call the text as Query) against the Possible
Values. The algorithm returns a list of the "matched" Possible Values. The list
is sorted in descending order by "relevance". The "relevance" here is a number that
displays a degree of relevance between a Possible Value and given Query.
The application is supposed to display the returned list as the "auto-completion
list" for the given Query:
We suppose the application has some kind of a "persistent storage".
The Storage is used to save the queries issued by the current User. The Storage
can save up to StorageMaxSize
count of the queries per a User. The
StorageMaxSize
value should be large enough to allow the Algorithm
building a "representative sampling of a User’s activity". Let’s say, the
StorageMaxSize
should be 10000 at least.
The word "persistent" in the "persistent storage" above means: the Storage should
save its content after the current User’s session is finished and/or the application
is "closed". The Storage may be implemented as a database, or a plain text file,
or an XML file – it does not matter to the algorithm.
The Main Ideas of the algorithm
Usage of both the "similarity factor" and "popularity factor"
The main idea of the algorithm is the following: it considers BOTH the "similarity
factor" and "popularity factor" while calculating the "relevancy Rank".
The "similarity factor" is exactly the same as one described in the article "An
algorithm of "phrases similarity calculation" and its usage in "autocompletion"
feature".
Let’s explain what the "popularity factor" term means. Suppose a Query is "Sal"
and the algorithm has matched 2 PossibleValues
by the Query: "Sal"
and "Sally". The "Sal" is "more similar" to the Query than the
"Sally" – the previous article explains why.
Let’s also suppose the PossibleValue
"Sal" was issued by a User
only once – say, 1 year ago. Whereas the PossibleValue
"Sally" was
issued by the same User many times – say, 100 times only during last week. So we
can say the PossibleValue
"Sally" is definitely "more popular"
than the PossibleValue
"Sal". This is an example of the "popularity
factor".
And now let’s put these things together. The algorithm takes into account the
both factors; and uses them simultaneously. In other words, the algorithm does not
use a principle like "the similarity factor is primary one and the popularity factor
is considered only when two PossibleValues
have exactly the same similarity
factor". No – none of the factors is a "primary" one. Considering the example above,
the algorithm may decide the PossibleValue
"Sally" should have a Rank
larger than the PossibleValue
"Sal" – because of the "Sally" query
was "much more popular" than the "Sal" was.
"Similarity factor"
Calculation of the number called SimilarityRank
(the number that
shows the "similarity factor") is described in the previous article "An algorithm
of "phrases similarity calculation" and its usage in "autocompletion" feature"
(http://www.codeproject.com/Articles/638280/An-algorithm-of-phrases-similarity-calculation-and)
"Popularity factor"
Calculation of a number called PopularityRank
(the number
that shows the "popularity factor") uses the following ideas:
- The Storage saves up to
StorageMaxSize
(see its definition
above) "usages". A "Usage" is a fact that a Query has been issued by the current
User.
For example:
- At 10:12:23 30-12-2012 the User issues a query "The Dark Knight"
- At 23:59:59 30-12-2012 the User issues a query "The Hangover"
- At 11:15:40 31-12-2012 the User issues a query "The Dark Knight"
again
Then the Storage will contain the following Usages (pairs of
{PossibleValue,
Date}
) ordered by decrease of the Date:
{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}
{PossibleValue: "The Hangover", Date: 23:59:59 30-12-2012}
{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
- While calculating
PopularityRank
for a PossibleValue
:
- We take into account all the
PossibleValue
Usages found
in the Storage.
For example: if we calculate PopularityRank
for a PossibleValue
"The Dark Knight" in the Storage example above, we take into account
the both Usages: the
{PossibleValue: "The Dark Knight", Date: 11:15:40
31-12-2012}
and the
{PossibleValue: "The Dark Knight", Date: 10:12:23
30-12-2012}
- the later Usage Date is, the more the given Usage brings to the
PopularityRank
calculated.
For example: we have 2 usages:
{PossibleValue: "The Dark Knight",
Date: 11:15:40 31-12-2012}
and
{PossibleValue: "The Dark Knight",
Date: 10:12:23 30-12-2012}
. The 1st Usage will bring more to the
PopularityRank
than the second Usage. The reason is: the Date 11:15:40
31-12-2012 is later than the 10:12:23 30-12-2012.
Details of the algorithm
In the previous section I described the main ideas of the algorithm. In this
section I will give the full description.
How the Storage handles the overflows
As we said earlier, the Storage saves up to StorageMaxSize
count
of Usages. But what to do when an "overflow" occurs: a User issues another query
(in other words – new Usage is reported to the Storage), but the Storage already
contains StorageMaxSize
of Usages?
In this situation we remove the oldest Usage – i.e., a Usage with the oldest
Date.
As a consequence of that, the Storage should have some kind of ordering of the
Usages by Date – to effectively implement "the oldest Usage search" operation.
How we look for PossibleValues and calculate their Ranks
Let’s consider how we look for a list of PossibleValues
and their
Ranks for a Query specified:
- Get a list of (
PossibleValue
, SimilarityRank
)
pairs using the previous algorithm "An algorithm of "phrases similarity
calculation" and its usage in "autocompletion" feature". As this is specified
in the previous article, all the SimilarityRank
values in the result
list will be > 0. In other words, PossibleValues that are "not similar at all"
to the Query will not be included into the list. Let’s call the list as
SimilarityResults
. - Calculate
PopularityRank
for every SimilarityResults[i].PossibleValue
found on the previous step. See the details of the calculation below - Now we have
SimilarityRank
and PopularityRank
values for every for every SimilarityResults[i].PossibleValue
.
Calculate the [final] Rank of every SimilarityResults[i].PossibleValue
as the following:
Rank = SimilarityRank * PopularityRank
How we calculate PopularityRank for a list of PossibleValues
The input data here is the list of PossibleValues.
- Define the following values:
Min2NdRank = 1.0
Max2NdRank = 100.0
MinFinalRank = 1.0
MaxFinalRank = 6.0
- Calculate a value called "
1stRank
" for every of the PossibleValues
.
See the details below - Find
min1stRank
– a minimum of all the 1stRank[]
values - Find
max1stRank
– a maximum of all the 1stRank[]
values - For every of the
1stRank[i]
values calculate a value called
"2ndRank
" as the following:
If (min1stRank == max1stRank)
Then
2ndRank[i] = Min2NdRank;
Else
2ndRank[i] = Min(1stRank[i] / min1stRank, Max2NdRank);
- Now we have the list
2ndRank[]
where every 2ndRank[i]
is in [Min2NdRank; Max2NdRank]
range.
Map every 2ndRank[i]
value into [MinFinalRank
,
MaxFinalRank
] range as the following:
delta = (2ndRank[i] - Min2NdRank) / (Max2NdRank - Min2NdRank);
PopularityRank[i] = MinFinalRank + delta*(MaxFinalRank - MinFinalRank);
- Eventually we have the list
PopularityRank[]
where every
PopularityRank[i]
value corresponds to an input PossibleValues[i];
and is in [MinFinalRank
, MaxFinalRank
] range.
How we calculate "1stRank" for a PossibleValue
The input data here is a PossibleValue
and a date/time point called
LatestTime
. The LatestTime
is a maximum of all the
PossibleValues[].Usages[].Date
that are considered.
AN EXAMPLE: suppose we are considering a list of 2 PossibleValues
:
"The Dark Knight" and "The Hangover". 1st one has 2 Usages:
{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}
{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
The 2nd one has the only usage:
{PossibleValue: "The Hangover", Date: 23:59:59 30-12-2012}
Then, the LatestTime
here will be (the same for both the "The Dark
Knight" and "The Hangover" PossibleValues
) the "11:15:40 31-12-2012".
Let’s also define the following constant:
- TimePortion
. This is a time period equal to 7 days (7 x 24 hours)
Then the 1stRank will be calculated as the following (below is a C#-like code):
double 1stRank = 0.0;
for (int i = 0; i < PossibleValue.Usages.Count; ++i)
{
var timeLentgth = LatestTime - PossibleValue.Usages[i].Date;
var timePortionsCount = (int)(timeLentgth.TotalMilliseconds / TimePortion.TotalMilliseconds);
var currUsageRank = 1.0 / (1.0 + timePortionsCount);
1stRank += currUsageRank;
}
The implementation
The Demo attached is a simple implementation of the algorithm on C#.
The folder Similarity comprises all the logic for the Similarity factor.
The folder Popularity comprises all the logic for the Popularity factor.
The class AutocompletionEngine.AutocompletionProvider
joins the
Similarity factor with the Popularity factor and makes the output autocompletion
list.
The Storage in the Popularity folder is, in fact, not "permanent" and exists
only while the Silverlight application is loaded. At all, the Storage is implemented
quite simple - this is just a demo.
Several places in the sources are marked by "TO
BE IMPROVED" comment. These are places where performance could be improved
if this code was used in a real application.
The static list AutocompletionEngine.Processor.Word.ArticlesEtc
contains the hardcoded list of the "second class" words (see the Similarity factor).
If you want the algorithm to work correctly with non-English queries, populate the
list with "second-class words" of your language.
The project UnitTest contains a set of NUnit-based automated tests of the
AutocompletionProvider
class. This is why the project requires
nunit.framework.dll library.
Possible improvements
We may introduce a limit for the count of Usages
per a PossibleValue
in the Storage. Say – to save not more than 500
latest Usages per a certain PossibleValue
. When a PossibleValue
already has 500 Usages stored, and a new Usage arrives – the oldest Usage will be
deleted to give room to the new one.