Introduction
Model is an idealization or abstraction of an actual process. Information Retrieval models can describe the computational process.
For example
- how documents
are ranked
- Note that how
documents or indexes are stored is implemented.
The goal of information retrieval (IR) is to
provide users with those documents that will satisfy their information need.
Retrieval models can attempt to describe the human Process, such as the
information need, interaction.
Retrieval model can be categorize as
- Boolean
retrieval model
- Vector space
model
- Probabilistic
model
- Model based on
belief net
The Boolean model of information
retrieval is a classical information retrieval (IR) model and is the first
and most adopted one. It is used by virtually all commercial IR systems
today.
Exact vs Best match
In exact match a query specifies precise criteria. Each document
either matches or fails to match the query. The results retrieved in exact
match is a set of document (without ranking).
In
best match a query describes good or best matching documents. In this case the
result is a ranked list of document.
The Boolean model here I’m going to deal with is the most common exact match
model.
Basic Assumption of Boolean Model
- An index term is
either present(1) or absent(0) in the document
- All index terms
provide equal evidence with respect to information needs.
- Queries are Boolean
combinations of index terms.
- X AND Y: represents doc that contains both X and Y
- X OR Y: represents doc that contains either X or Y
- NOT X: represents the doc that do not contain X
Boolean Queries
Example
User
information need: Interested
to know about Everest and Nepal
User Boolean query: Everest AND
Nepal
Implementation Part
Example of Input collection
Doc1= English
tutorial and fast track
Doc2 = learning latent semantic indexing
Doc3 = Book on semantic indexing
Doc4 = Advance in structure and semantic indexing
Doc5 = Analysis of latent structures
Query problem: advance
and structure AND NOT analysis
Boolean Model Index Construction
First we build the term-document
incidence matrix which represents a list of all the distinct terms and their
presence on each document (incidence vector). If the document contains the term
than incidence vector is 1 otherwise 0.
Terms/doc | Doc1 | Doc2 | Doc3 | Doc4 | Doc5 |
English | 1 | 0 | 0 | 0 | 0 |
Tutorial | 1 | 0 | 0 | 0 | 0 |
Fast | 1 | 0 | 0 | 0 | 0 |
Track | 1 | 0 | 0 | 0 | 0 |
Books | 0 | 1 | 0 | 0 | 0 |
Semantic | 0 | 1 | 1 | 1 | 0 |
Analysis | 0 | 1 | 0 | 0 | 1 |
Learning | 0 | 0 | 1 | 0 | 0 |
Latent | 0 | 0 | 1 | 0 | 1 |
Indexing | 0 | 0 | 1 | 1 | 0 |
Advance | 0 | 0 | 0 | 1 | 0 |
Structures | 0 | 0 | 0 | 1 | 1 |
So now we have 0/1 vector for each term. To answer the query
we take the vectors for
advance,
structure and
analysis, complement
the last, and then do a bitwise AND.
Doc1 | Doc2 | Doc3 | Doc4 | Doc5 | |
0 | 0 | 0 | 1 | 0 | |
0 | 0 | 0 | 1 | 1 | (AND) |
0 | 0 | 0 | 1 | 0 | |
1 | 0 | 1 | 1 | 0 | (NOT analysis) |
0 | 0 | 0 | 1 | 0 | |
| | | Doc4 | | |
Hence Doc4 is retrieved here.
Diving into the code
Document tokenization and indexing
Tokenization is the task of chopping the input
character sequence into pieces, called tokens. Here I have used termsCollection
to store the retrieve tokens from the input
document and HasSet collection type
distinctTerm
to create the index of tokens which ensures that the collection type contains
only distinct terms .
foreach (string file in Directory.EnumerateFiles("F:\\IR\\", "*.txt"))
{
string contents = File.ReadAllText(file);
String[] termsCollection = RemoveStopsWords(contents.ToUpper().Split(' '));
foreach (string term in termsCollection)
{
if (!stopWords.Contains(term))
{
distinctTerm.Add(term);
}
}
documentCollection.Add(file, termsCollection.ToList());
documentContentList.Add(count, contents);
count++;
}
Removing stop words
Most Search Engines do not consider extremely common words in order to save disk space or to speed up search results.
These filtered words are known as 'Stop Words' here I have used generic collection stopWords
which contains a list of stop word, to remove it from the document.
public static string[] RemoveStopsWords(string[] str)
{
List<string> terms = new List<string>();
foreach (string term in str)
{
if (!stopWords.Contains(term))
{
terms.Add(term);
}
}
return terms.ToArray();
}
Term - Document Incidence matrix creation
If a term appears on a document than its term incidence vector is 1 else 0.
The GetTermDocumentIncidenceMatrix
function returns the term document incidence matrix for all distinct terms which are stored on
HashSet
collection. I have used Dictionary collection type termDocumentIncidenceMatrix
to store the term and its incidence vectors on each document.
public static Dictionary<string, List<int>> GetTermDocumentIncidenceMatrix(HashSet<string>
distinctTerms,Dictionary<string,List<string>> documentCollection)
{
Dictionary<string, List<int>> termDocumentIncidenceMatrix =
new Dictionary<string, List<int>>();
List<int> incidenceVector = new List<int>();
foreach (string term in distinctTerms)
{
incidenceVector = new List<int>();
foreach(KeyValuePair<string ,List<string>> p in documentCollection)
{
if (p.Value.Contains(term))
{
incidenceVector.Add(1);
}
else
{
incidenceVector.Add(0);
}
}
termDocumentIncidenceMatrix.Add(term, incidenceVector);
}
return termDocumentIncidenceMatrix;
}
Retrieving incidence vector of individual term
public static List<int> GetTermIncidenceVector(string term)
{
return termDocumentIncidenceMatrix[term.ToUpper()];
}
Query Processor
The string variable bitWiseOp
holds the Boolean operator that exists on the query. Generally the term but is treated as semantically
equivalent with Boolean AND operator. The collection previousTermIncidenceV
and nextTermIncidenceV
is used
to hold the term incidence vector of each term that appears before and after the Boolean operator on the query.
public static List<int> ProcessQuery(string query)
{
string bitWiseOp = string.Empty;
string[] queryTerm = RemoveStopsWords(query.ToUpper().Split(''));
FilterQueryTerm(ref queryTerm);
List<int> previousTermIncidenceV = null;
List<int> nextTermsIncidenceV = null;
List<int> resultSet = null;
Boolean hasPreviousTerm = false;
Boolean hasNotOperation = false;
foreach (string term in queryTerm)
{
if (!booleanOperator.Contains(term)&&!term.Equals("BUT"))
{
if (hasNotOperation)
{
if (hasPreviousTerm)
{
nextTermsIncidenceV = ProcessBooleanOperator("NOT",
GetTermIncidenc eVector(term), nextTermsIncidenceV);
}
else
{
previousTermIncidenceV = ProcessBooleanOperator("NOT",
GetTermIncid enceVector(term), nextTermsIncidenceV);
resultSet = previousTermIncidenceV;
}
hasNotOperation = false;
}
else if (!hasPreviousTerm)
{
previousTermIncidenceV = GetTermIncidenceVector(term);
resultSet = previousTermIncidenceV;
hasPreviousTerm = true;
}
else
{
nextTermsIncidenceV = GetTermIncidenceVector(term);
}
}
else if (term.Equals("NOT"))
{
hasNotOperation = true;
}
else
{
if (term.Equals("BUT"))
{
bitWiseOp = "AND";
}
else
bitWiseOp = term;
}
if (nextTermsIncidenceV != null && !hasNotOperation)
{
resultSet = ProcessBooleanOperator(bitWiseOp,
previousTermIncidenceV, nextT ermsIncidenceV);
previousTermIncidenceV = resultSet;
hasPreviousTerm = true;
nextTermsIncidenceV = null;
}
}
return resultSet;
}
Filter Query Term
private static void FilterQueryTerm(ref string[] str)
{
List<string> _queryTerm = new List<string>();
foreach (string queryTerm in str)
{
if (queryTerm.ToUpper().Equals("BUT") ||termDocumentIncidenceMatrix.ContainsKey(queryTerm.ToUpper()) ||booleanOperator.Contains(queryTerm))
{
_queryTerm.Add(queryTerm);
}
}
str = _queryTerm.ToArray();
}
Processing Boolean operators
public static List<int> ProcessBooleanOperator(string op,
List<int> previousTermV,List<int> nextTermV)
{
List<int> resultSet = new List<int>();
if(op.Equals("NOT"))
{
foreach(int a in previousTermV)
{
if (a == 1)
{
resultSet.Add(0);
}
else
{
resultSet.Add(1);
}
}
}
else if (op.ToUpper().Equals("AND"))
{
for (int a = 0; a < previousTermV.Count; a++)
{
if (previousTermV[a] == 1 && nextTermV[a] == 1)
{
resultSet.Add(1);
}
else
{
resultSet.Add(0);
}
}
}
else if (op.ToUpper().Equals("OR"))
{
for (int a = 0; a < previousTermV.Count; a++)
{
if (previousTermV[a] == 0 && nextTermV[a] == 0)
{
resultSet.Add(0);
}
else
{
resultSet.Add(1);
}
}
}
return resultSet;
}