Introduction
This article should help you to get into writing extensions for Strus. Extensions can be separately built and installed as dynamically loadable modules (shared objects). The goal of this article is to enable you to implement, build and install extensions for Strus on your own.
We introduce the interfaces for operations on sets of postings, weighting functions and summarization and show an example module implementing one of each type of extension. The possibility to load an alternative key/value store database implementation as module is not discussed in this article.
Background
Strus is a set of libraries and tools to build scalable search engines. For tutorials that show how to build a search engine with Strus have a look at my previous code project articles.
Remark
The sources in this article do not implement all methods of the interface. The examples are outdated but good enough as documentation. The sources to download are up-to-date though.
Table of Contents
- Description of the example problem to solve
- Error handling (exceptions) in Strus
- Posting join operators
- Posting join operator implementation
- Weighting functions and their implementation
- Summarizers and their implementation
- The assembled module
- Build & installation
- Resume
Glossary
- Posting: For each term, we have a list that records which documents the term occurs in. Each item in the list - which records that a term appeared in a document (and, later, often, the positions in the document) - is conventionally called a posting. [nlp.stanford.edu]
- Feature: Information extracted from an object and used during query processing. [people.ischool.berkeley.edu]
- Forward index: The forward index stores a list of words for each document [en.wikipedia.org].
Prerequisites
For compiling and running the sources explained here, you need either to install the strus packages strusBase,strus,strusAnalyzer and strusModule or to build the software on your own. The dependency to strusAnalyzer is due to the fact that strusModule is dependent on it. Analyzer modules are not subject of this article. There will be another article introducing strus analyzer modules.
For cloning the project or downloading the latest sources, go to strusBase, strus, strusAnalyzer, strusModule
You need gcc or clang and cmake for building the software. The current build status of the master branch of the strus projects can be inpsected here.
Example problem to solve
The example join operator implemented here is finding areas of a given size (range of positions) in documents containing at least a subset of given size of all argument features. The example weighting function is returning the inverse of the minimum size of such a window. The example summarizer is returning all forward index features inside of this window.
The presented example functions are not part of the Strus core. They are also not considered to be worth to get into the core, because ranking of documents based on weighting with the minimal window size is not stable. Not stable in this context means that extending a query with new query features shuffles preceding result lists instead of just adding weight to documents matching the new features. This makes search as a process of query refinement difficult for the user. I have chosen these example functions because they are relatively complex, but easy to understand on the other hand.
Example illustrated
This picture marks a window of size 7 containing the two terms "building" and "Dubai":
Error handling
For writing extension modules for Strus, you have to understand how errors are handled in Strus. Unfortunately for C++ there exists no portable way of throwing exceptions across module boundaries. The only way to have exceptions thrown across module boundaries is to compile all sources involved with the same compiler and with all symbols exported. [See C++ Coding Standards: 101 Rules, Guidelines, and Best Practices, 62. Don't allow exceptions to propagate across module boundaries] But this would impede the deployment of the Strus libraries and modules as binary artifacts. That's why interfaces in Strus are exception-free.
Strus approaches the problem with an interface for reporting errors that is passed down to all objects implementing interfaces that might be called accross module boundaries. This potentially affects any interface of Strus. Therefore any method of an interface has to catch any possible exception thrown and report the error message of the exception via the ErrorBufferInterface declared on object construction. Because all objects of Strus implementing an interface are constructed by another interface object with a create method starting with the root object interface StorageObjectBuilderInterface, the error buffer interface does not appear in Strus interfaces. But you have to be aware of it and pass it down to all objects created by your module. An interface method has to define the try/catch frame in the method and the moving of the exception content to the error buffer object. This is usually done by preprocessor macros, that help to reduce boilerplate code and improve readability. Keep in mind that any code that potentially allocates memory might throw an exception that has to be caught.
The following code snippet shows the definition of 2 macros, one for methods returning a value and one for methods not returning a value:
#define CATCH_ERROR_MAP_RETURN( HND, VALUE)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in ...");\
return VALUE;\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "error in ...: %s", err.what());\
return VALUE;\
}\
#define CATCH_ERROR_MAP( HND)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in window iterator");\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "error in ...: %s", err.what());\
}\
And here we have an example method implementation assuming m_errorhnd
as pointer to the ErrorBufferInterface of the class.
virtual Index skipPos( const Index& posno_)
{
try
{
... do something that might throw a bad alloc or runtime error
return <RESULT>;
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, 0)
}
You also have to be aware of inspecting the error buffer after calling an interface function when it is needed. Sometimes it is not important to detect an error immediately. It is considered as best practice to check for errors when calling methods that return a value and in methods that have to check for error to ensure consistency (e.g. a transaction commit). To allow late detection of errors helps to avoid network roundtrips when calling methods via a remote interface.
Internationalization
The Strus core modules use gettext for internationalization (error messages in a specific language). This aspect is not handled in this example module. Have a look at the core modules to get an idea of how internationalization could be handled.
Posting join operators
We start with the posting join operator implementation. This is the interface most difficult to implement. There are rare cases, where you really want to implement one on your own. But it is good to read the following section to understand what posting join operators are.
Basic model
The posting join operators implement the set operations on postings. Sets of postings are defined as
where d represents an internal document number and p a term position.
A posting join operator is a set operation on a list of sub expression posting sets. The set operation can be parameterized with two additional arguments:
- The range, specifying a proximity condition of the positions of the argument sets involved
- The cardinality, defining the cardinality of a selection of arguments needed to build a result element.
The parameters range and cardinality are used for a more intiutive construction of complex expressions.
The basic structure of the operations is the Boolean algebra of the set of postings with two additional unary operators. One is the predecessor set:
The other is the successor set:
The predecessor and the successor set construction are used to relate productions of a regular language with terms as alphabet to set operations:
But expressing everything in a language having only the basic regular productions for expressing patterns would be very cumbersome. The range parameter can and should only be used for defining higher level expressions with a more intiutive notion of proximity. Like for example "return me the occurrencies of three terms appearing in the same sentence", where the sentence structure is constructed with a delimiter represented as term and thus as set of postings.
The cardinality parameter on the other hand helps to construct patterns that match for any selection of a given size of the argument terms or expressions. The number of possible selections of subsets tends to grow fast while the result set they could produce may be calculated in a more efficient way.
If you define the parameters range and cardinality for your posting join operations, you should keep this relation to the basic structure in mind. This strict definition of the meaning of these parameters is also the reason why the parameters are fixed positional parameters of the operator constructor.
Representation of the model
The sets of postings are not represented as data, but as iterator with skip operations to the least upperbound document number or position. The two skip operations provided are named skipDoc
and skipPos
.
If you are familiar with Lucene, you will recognice these least upperbound skip operations as methods named advance there.
The fetching of the next element after (d,p) in the set of postings would be implemented as follows (PsuedoCode):
FUNCTION next( d, p )
p = skipPos(p+1)
while (!p) do
d = skipDoc(++d)
if (!d) then
return NULL,NULL
else
p = skipPos(0)
end
end
return d,p
end
But this way of iterating through the set of postings is rarely used. In most cases you need to advance to a documents of interest. Then, in the context of that document, you need the skipPos operation to iterate on the elements.
An posting iterator is implemented as interface that either fetches the postings associated with a term in the storage or an object that implements its upperbound skip operations as function on the argument posting iterators.
The following pseudocode illustrates the implementation of the skipDoc
and skipPos
methods of an iterator representing the intersection of two posting sets arg1
and arg2
. We use a helper method skipDocCandidate
, that does an intersection on the document numbers only, thus returning a candidate that might not contain a valid match. We then use skipPos
to ensure that skipDoc
returns a document number referring to a non empty set of postings.
FUNCTION skipDocCandidate( self, d )
d1 = self->arg1->skipDocCandidate( d)
d2 = self->arg2->skipDocCandidate( d)
while (d1 != d2 && d1 && d2) do
if (d1 < d2) then
d1 = self->arg1->skipDocCandidate( d2 )
else
d2 = self->arg2->skipDocCandidate( d1 )
end
end
return d1
end
FUNCTION skipPos( self, p )
p1 = self->arg1->skipPos( p)
p2 = self->arg2->skipPos( p)
while (p1 != p2 && p1 && p2) do
if (p1 < p2) then
p1 = self->arg1->skipPos( p2 )
else
p2 = self->arg2->skipPos( p1 )
end
end
return p1
end
FUNCTION skipDoc( self, d )
d = skipDocCandidate( self, d)
while (d && !skipPos( self, 0) do
d = skipDocCandidate( self, d+1 )
end
end
The method skipDoc
iterates on candidate documents that have both matches or arg1
and arg2
until it finds a document containing a match with identical positions.
C++ interface
Posting iterator
Iterators on postings are represented as iterator interface with the methods to advance to the least upperbound of the passed argument as already introduced in the previous section.
For the document number:
Index skipDoc( const Index& docno)
and for the position
Index skipPos( const Index& posno)
and the helper for the document candidates
Index skipDocCandidate( const Index& docno)
The method skipDocCanditate
is a function that does not have to return a valid result. What it must not do is to miss a result. It is used to reduce the number of inspected positions and thus reducing reads on data blocks containing position information. If you for example implement an operator that returns occurrences of subsequent terms, you can return the intersection of the skipDocCandidate
results of the arguments as result of your skipDocCandidate
. This can lead to a significant reduction of position block reads. Your skipDoc
implementation that has to return valid results on the other hand, can call its own skipDocCandidate
first, before inspecting the positions to ensure a valid result.
The method skipDocCanditate
is only used internally by iterators for the implementation of skipDoc
. The resulting iterator on the set of postings is implemented with alernating calls of skipDoc
and skipPos
.
Other methods to implement are
const char* featureid() const;
that returns a unique id of the iterator used for debugging and tracing.
Index documentFrequency() const;
that returns the document frequency or an upper bound estimate of it. An upper bound estimate is returned in the case a full table scan on the position blocks would be needed to calculate the real value.
unsigned int frequency();
that returns the number of postings in the current document.
Index docno() const;
that returns the document number of the currently visited document.
Index posno() const;
that returns the position of the currently visited posting.
Index length() const;
that returns the length the currently visited posting. The length is not part of the model but useful for post processing of matches as for example markup of matches. It is not considered in expressions like sequence. This is counterintuitive but still not considered because it cannot be modeled with pure set operations.
You also need to implement a constructor taking the parameters and the error buffer interface as arguments
MyPostingIterator(
const std::vector< strus::Reference< strus::PostingIteratorInterface> >& args,
unsigned int range_,
unsigned int cardinality_,
strus::ErrorBufferInterface* errorhnd_)
Posting join operator
Posting join operators are represented as posting join operator interface implementing a virtual constructor method for creating a posting iterator as iterator on the operation result. You can create an iterator on postings from a list of sub expression posting iterators, the range (proximity) and the cardinality (selector for permutations on input subsets) argument. Passing 0 as cardinality means the default, that all arguments are needed to build a result element. The create method looks as follows:
PostingIteratorInterface* createResultIterator(
const std::vector<Reference<PostingIteratorInterface> >& argitrs,
int range,
unsigned int cardinality) const;
Additionaly the exists a method for getting the description of the operator for introspection:
Description getDescription() const;
The returned description object has a method
const std::string& text() const
to get the description string of this join operator.
Example implementation
The following example implements the posting join operator and its result iterator that returns the start of a window within a maximum range (proximity) containing at least the number of argument postings specified by the cardinality argument. First we define some helper functions:
Macros for error handling
We already introduced the error handling and exception-free interfaces of Strus in a previous section. The following macro is a helper for boilerplate code catching exceptions and reporting them via an error buffer interface, returning an undefined value. The caller of the function can check for a failed operation by inspecting
the ErrorBufferInterface passed to the object (ErrorBufferInterface::hasError()
):
#define CATCH_ERROR_MAP_RETURN( HND, VALUE, MSG)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in window iterator");\
return VALUE;\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "error in minwin window iterator %s: %s", MSG, err.what());\
return VALUE;\
}
Procedure to get the first/next document match
The following function tries to find a subset of args with a size of at least cardinality with each element having the same upper bound of docno
as match. It returns the smallest fit of such a least upper bound of docno
as result.
With allowEmpty
it is specified, if the document returned should only be a candidate element, that may result in an empty set, or if it should only return documents with real matches.
static Index getFirstAllMatchDocnoSubset(
std::vector<Reference< PostingIteratorInterface> >& args,
Index docno,
bool allowEmpty,
std::size_t cardinality,
std::vector<PostingIteratorInterface*>& candidate_set)
{
if (args.empty()) return 0;
candidate_set.clear();
Index docno_iter = docno;
for (;;)
{
std::vector<Reference< PostingIteratorInterface> >::iterator
ai = args.begin(), ae = args.end();
std::size_t nof_matches = 0;
Index match_docno = 0;
for (; ai != ae; ++ai)
{
Index docno_next = (*ai)->skipDocCandidate( docno_iter);
if (docno_next)
{
if (match_docno)
{
if (match_docno == docno_next)
{
candidate_set.push_back( ai->get());
++nof_matches;
}
else if (match_docno > docno_next)
{
candidate_set.clear();
candidate_set.push_back( ai->get());
match_docno = docno_next;
nof_matches = 1;
}
}
else
{
candidate_set.clear();
candidate_set.push_back( ai->get());
match_docno = docno_next;
nof_matches = 1;
}
}
}
if (nof_matches >= cardinality)
{
if (!allowEmpty)
{
std::vector<PostingIteratorInterface*>::iterator
ci = candidate_set.begin(),
ce = candidate_set.end();
while (ci != ce)
{
if (match_docno != (*ci)->skipDoc( match_docno))
{
--nof_matches;
if (nof_matches < cardinality) break;
candidate_set.erase( ci);
}
else
{
++ci;
}
}
if (nof_matches < cardinality)
{
docno_iter = match_docno+1;
candidate_set.clear();
continue;
}
}
return match_docno;
}
else if (match_docno)
{
docno_iter = match_docno+1;
}
else
{
break;
}
}
return 0;
}
Position window
The following helper class implements an iterator on windows of a given maximum size that contain a subset of a specified maximum size of the argument features. The class called PositionWindow
is used in all functions exported by the module presented in this article. So it will be reused for the weighting function and the summarizer implementation.
Header file
#ifndef _STRUS_POSITION_WINDOW_HPP_INCLUDED
#define _STRUS_POSITION_WINDOW_HPP_INCLUDED
#include "strus/index.hpp"
#include "strus/postingIteratorInterface.hpp"
#include <set>
namespace strus {
class PositionWindow
{
public:
struct Element
{
PostingIteratorInterface* itr;
Index pos;
Element()
:itr(0),pos(0){}
Element( PostingIteratorInterface* itr_, Index pos_)
:itr(itr_),pos(pos_){}
Element( const Element& o)
:itr(o.itr),pos(o.pos){}
bool operator < ( const Element& o) const
{
if (pos == o.pos)
{
return itr < o.itr;
}
return pos < o.pos;
}
};
PositionWindow(
const std::vector<PostingIteratorInterface*>& args,
unsigned int range_,
unsigned int cardinality_,
Index firstpos_);
bool first();
bool next();
unsigned int size() const;
unsigned int pos() const;
private:
std::set<Element>::iterator getWinTopElement();
private:
std::vector<PostingIteratorInterface*> m_args;
std::set<Element> m_set;
unsigned int m_setsize;
unsigned int m_range;
unsigned int m_cardinality;
};
}
#endif
Source file
#include "positionWindow.hpp"
#include "strus/postingJoinOperatorInterface.hpp"
#include "strus/postingIteratorInterface.hpp"
#include "strus/errorBufferInterface.hpp"
#include <cstdio>
using namespace strus;
PositionWindow::PositionWindow(
const std::vector<PostingIteratorInterface*>& args,
unsigned int range_,
unsigned int cardinality_,
Index firstpos_)
:m_args(args)
,m_setsize(0)
,m_range(range_)
,m_cardinality(cardinality_>0?cardinality_:args.size())
{
std::vector<PostingIteratorInterface*>::iterator
ai = m_args.begin(), ae = m_args.end();
for (; ai != ae; ++ai)
{
PostingIteratorInterface* itr = *ai;
Index pos = itr->skipPos( firstpos_);
if (pos)
{
m_set.insert( Element( itr, pos));
++m_setsize;
}
}
}
std::set<PositionWindow::Element>::iterator PositionWindow::getWinTopElement()
{
if (m_cardinality == 0) return m_set.end();
unsigned int ii=0;
std::set<Element>::iterator
si = m_set.begin(), se = m_set.end();
for (; ii<m_cardinality && si != se; ++ii,++si){}
return --si;
}
bool PositionWindow::first()
{
return m_setsize >= m_cardinality;
}
bool PositionWindow::next()
{
PostingIteratorInterface* itr = m_set.begin()->itr;
std::set<Element>::iterator top = getWinTopElement();
Index posdiff = top->pos - m_set.begin()->pos;
Index skipsize = posdiff > (Index)m_range ? (posdiff - (Index)m_range) : 1;
Index pos = m_set.begin()->pos;
m_set.erase( m_set.begin());
Index newpos = itr->skipPos( pos + skipsize);
if (newpos)
{
m_set.insert( Element( itr, newpos));
}
else
{
--m_setsize;
}
return m_setsize >= m_cardinality;
}
unsigned int PositionWindow::size() const
{
std::set<Element>::iterator si = getWinTopElement();
return si->pos - m_set.begin()->pos;
}
unsigned int PositionWindow::pos() const
{
return (m_setsize >= m_cardinality)?m_set.begin()->pos:0;
}
Now we have everything to implement the posting iterator for the window and its join operator:
Posting iterator
The following class implements the iterator on postings created from the posting sets join operator implemented in this module.
class WindowPostingIterator
:public PostingIteratorInterface
{
public:
WindowPostingIterator(
const std::vector<Reference< PostingIteratorInterface> >& args,
unsigned int range_,
unsigned int cardinality_,
ErrorBufferInterface* errorhnd_)
:m_docno(0)
,m_posno(0)
,m_argar(args)
,m_documentFrequency(-1)
,m_range(range_)
,m_cardinality(cardinality_)
,m_errorhnd(errorhnd_)
{
std::vector<Reference< PostingIteratorInterface> >::const_iterator
ai = m_argar.begin(), ae = m_argar.end();
for (int aidx=0; ai != ae; ++ai,++aidx)
{
if (aidx) m_featureid.push_back('=');
m_featureid.append( (*ai)->featureid());
}
m_featureid.append( ".win");
}
virtual ~WindowPostingIterator(){}
virtual Index skipDoc( const Index& docno_)
{
try
{
m_docno = getFirstAllMatchDocnoSubset(
m_argar, docno_, false,
m_cardinality, m_candidate_set);
while (m_docno && skipPos(0) == 0)
{
m_docno = getFirstAllMatchDocnoSubset(
m_argar, m_docno+1, false,
m_cardinality, m_candidate_set);
}
return m_docno;
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, (m_docno=0), "in skip document")
}
virtual Index skipDocCandidate( const Index& docno_)
{
try
{
return m_docno=getFirstAllMatchDocnoSubset(
m_argar, docno_, true,
m_cardinality, m_candidate_set);
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, (m_docno=0), "in skip document candidate")
}
virtual Index skipPos( const Index& posno_)
{
try
{
PositionWindow win( m_candidate_set, m_range, m_cardinality, posno_);
if (!win.first()) return m_posno=0;
return m_posno=win.pos();
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, (m_posno=0), "in skip position")
}
virtual const char* featureid() const
{
return m_featureid.c_str();
}
virtual Index documentFrequency() const
{
try
{
if (m_documentFrequency < 0)
{
std::vector<Reference< PostingIteratorInterface> >::const_iterator
ai = m_argar.begin(), ae = m_argar.end();
if (ai == ae) return 0;
m_documentFrequency = (*ai)->documentFrequency();
for (++ai; ai != ae; ++ai)
{
Index df = (*ai)->documentFrequency();
if (df < m_documentFrequency)
{
m_documentFrequency = df;
}
}
}
return m_documentFrequency;
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, 0, "in document frequency estimation")
}
virtual unsigned int frequency()
{
Index idx=0;
unsigned int rt = 0;
for (;0!=(idx=skipPos( idx)); ++idx,++rt){}
return rt;
}
virtual Index docno() const
{
return m_docno;
}
virtual Index posno() const
{
return m_posno;
}
virtual Index length() const
{
return 1;
}
private:
Index m_docno;
Index m_posno;
std::vector<Reference< PostingIteratorInterface> > m_argar;
std::string m_featureid;
mutable Index m_documentFrequency;
unsigned int m_range;
unsigned int m_cardinality;
std::vector<PostingIteratorInterface*> m_candidate_set;
ErrorBufferInterface* m_errorhnd;
};
Posting join operator
The following class implements the constructor of a posting iterator from the posting set join operator arguments:
class WindowPostingJoinOperator
:public PostingJoinOperatorInterface
{
public:
WindowPostingJoinOperator( ErrorBufferInterface* errorhnd_)
:m_errorhnd(errorhnd_){}
virtual ~WindowPostingJoinOperator(){}
virtual PostingIteratorInterface* createResultIterator(
const std::vector<Reference<PostingIteratorInterface> >& argitrs,
int range,
unsigned int cardinality) const
{
try
{
if (range < 0)
{
throw std::runtime_error(
"no negative range allowed for window iterator");
}
return new WindowPostingIterator(
argitrs, (unsigned int)range,
cardinality, m_errorhnd);
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, 0, "in create result iterator")
}
virtual Description getDescription() const
{
try
{
return Description(
"iterator on windows of a maximum size (range) "
"containing a defined subset of features (cardinality)");
}
CATCH_ERROR_MAP_RETURN( *m_errorhnd, Description(), "in get description")
}
private:
ErrorBufferInterface* m_errorhnd;
};
Interface for the module
Header file
The module we will create needs the a constructor function to create the posting set join operator:
#ifndef _STRUS_POSTING_JOIN_OPERATOR_WINDOW_HPP_INCLUDED
#define _STRUS_POSTING_JOIN_OPERATOR_WINDOW_HPP_INCLUDED
namespace strus {
class PostingJoinOperatorInterface;
class ErrorBufferInterface;
PostingJoinOperatorInterface* createWindowJoinOperator(
ErrorBufferInterface* errorhnd);
}
#endif
Source file
PostingJoinOperatorInterface* strus::createWindowJoinOperator( ErrorBufferInterface* errorhnd)
{
try
{
return new WindowPostingJoinOperator( errorhnd);
}
CATCH_ERROR_MAP_RETURN( *errorhnd, 0, "in create join operator")
}
Weighting functions
Weighting functions are used to assign weights to documents. They use posting set itetators and some numeric parameters as input to calculate the weight of a document. The weight of a document is used as criterion to sort result documents in the query evaluation. The object needed to calculate a weighting function result is created in 3 steps, each implemening one level of parametrization. A module defining a weighting function exports a function that creates a weighting function interface instantiated with the error buffer interface. The next level is to create a weighting function instance interface. This instance of a weighting function can be parameterized using the addNumericParameter
and the addStringParameter
methods. The next and final level is to create an execution context interface of the weighting function. The context is the object is used to parameterize the weighting function with objects related to a specific query issued in a specific storage state. The execution context of the function is used to calculate the weight for a document with the method
double call( const Index& docno);
The following subsection shows the implementation of all parametrization steps of a weighting function calculating the inverse of the minimal window containing at least a defined number of query features.
We start with the execution context class:
Macros for error handling
We already introduced the error handling and exception-free interfaces of Strus and provided a similar example for our posting join operator. The same type of macro we define here:
#define CATCH_ERROR_MAP_RETURN( HND, VALUE, MSG)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in window weighting function");\
return VALUE;\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "%s (minwin window weighting function): %s", MSG, err.what());\
return VALUE;\
}
#define CATCH_ERROR_MAP( HND, MSG)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in window weighting function");\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "%s (minwin window weighting function): %s", MSG, err.what());\
}
Execution context
The weighting function execution context is parameterized with the features to use for weighting in the query. It gets the global statistics and an interface to access the storage and the metadata table with the constructor called from the function instance. The execution context uses the class PositionWindow
we defined in the section describing the posting iterator.
class MinWinWeightingFunctionContext
:public WeightingFunctionContextInterface
{
public:
MinWinWeightingFunctionContext(
ErrorBufferInterface* errhnd_, int range_, unsigned int cardinality_)
:m_errhnd(errhnd_),m_range(range_),m_cardinality(cardinality_)
{}
virtual ~MinWinWeightingFunctionContext(){}
virtual void addWeightingFeature(
const std::string& name_,
PostingIteratorInterface* postingIterator_,
float ,
const TermStatistics& )
{
try
{
if (name_ == "match")
{
m_arg.push_back( postingIterator_);
}
else
{
throw std::runtime_error( "unknown weighting feature name");
}
}
CATCH_ERROR_MAP( *m_errhnd, "in add weighting feature");
}
virtual double call( const Index& docno)
{
try
{
std::vector<PostingIteratorInterface*>::const_iterator
ai = m_arg.begin(), ae = m_arg.end();
std::vector<PostingIteratorInterface*> matches;
matches.reserve( m_arg.size());
for (; ai != ae; ++ai)
{
if (docno == (*ai)->skipDoc( docno))
{
matches.push_back( *ai);
}
}
PositionWindow win( matches, m_range, m_cardinality, 0);
unsigned int minwinsize = m_range+1;
bool more = win.first();
for (;more; more = win.next())
{
unsigned int winsize = win.size();
if (winsize < minwinsize)
{
minwinsize = winsize;
}
}
if (minwinsize < m_range)
{
return 1.0/(minwinsize+1);
}
else
{
return 0.0;
}
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, 0.0, "in call");
}
virtual std::string debugCall( const Index& docno)
{
return std::string();
}
private:
ErrorBufferInterface* m_errhnd;
std::vector<PostingIteratorInterface*> m_arg;
int m_range;
unsigned int m_cardinality;
};
Function instance
The weighting function execution context is parameterized with numeric and string parameters. It provides the virtual constructor to create a the weigthing function context.
class MinWinWeightingFunctionInstance
:public WeightingFunctionInstanceInterface
{
public:
explicit MinWinWeightingFunctionInstance( ErrorBufferInterface* errhnd_)
:m_errhnd(errhnd_),m_range(1000),m_cardinality(0){}
virtual ~MinWinWeightingFunctionInstance(){}
virtual void addStringParameter( const std::string& name, const std::string&)
{
try
{
throw std::runtime_error( std::string( "unknown numeric parameter ") + name);
}
CATCH_ERROR_MAP( *m_errhnd, "in add string parameter");
}
virtual void addNumericParameter( const std::string& name, const ArithmeticVariant& value)
{
try
{
if (name == "maxwinsize")
{
m_range = value.toint();
if (m_range <= 0)
{
throw std::runtime_error("proximity range parameter negative or null");
}
}
else if (name == "cardinality")
{
if (value.type != ArithmeticVariant::UInt && value.type != ArithmeticVariant::Int)
{
throw std::runtime_error("illegal cardinality parameter");
}
m_cardinality = value.touint();
}
else
{
throw std::runtime_error( std::string( "unknown numeric parameter ") + name);
}
}
CATCH_ERROR_MAP( *m_errhnd, "in add numeric parameter");
}
virtual WeightingFunctionContextInterface* createFunctionContext(
const StorageClientInterface* ,
MetaDataReaderInterface* ,
const GlobalStatistics& ) const
{
try
{
return new MinWinWeightingFunctionContext( m_errhnd, m_range, m_cardinality);
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create function context");
}
virtual std::string tostring() const
{
try
{
std::ostringstream rt;
rt << "maxwinsize=" << m_range << ", cardinality=" << m_cardinality;
return rt.str();
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, std::string(), "in tostring");
}
private:
ErrorBufferInterface* m_errhnd;
std::vector<PostingIteratorInterface*> m_arg;
int m_range;
unsigned int m_cardinality;
};
Weighting function
The weighting function is the unparameterized object that just provides a virtual constructor to create a function instance for parametrization. Additionally it implements a get description for introspection.
class MinWinWeightingFunction
:public WeightingFunctionInterface
{
public:
MinWinWeightingFunction( ErrorBufferInterface* errhnd_)
:m_errhnd(errhnd_)
{}
virtual ~MinWinWeightingFunction(){}
virtual WeightingFunctionInstanceInterface* createInstance(
const QueryProcessorInterface*) const
{
try
{
return new MinWinWeightingFunctionInstance( m_errhnd);
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create instance");
}
virtual FunctionDescription getDescription() const
{
typedef FunctionDescription::Parameter P;
FunctionDescription rt("Calculate the document weight as the inverse "
"of the minimal window size containing a subset of the document features");
rt( P::Feature, "match",
"defines the query features to find in a window");
rt( P::Numeric, "maxwinsize",
"the maximum size of a window to search for");
rt( P::Numeric, "cardinality",
"the number of features to find at least in a window");
return rt;
}
private:
ErrorBufferInterface* m_errhnd;
};
Interface for the module
Header file
The module we will create needs the a constructor function to create the weighting function object instantiated with the error buffer interface of the application.
#ifndef _STRUS_WEIGHTING_FUNCTION_MINWIN_HPP_INCLUDED
#define _STRUS_WEIGHTING_FUNCTION_MINWIN_HPP_INCLUDED
namespace strus {
class WeightingFunctionInterface;
class ErrorBufferInterface;
WeightingFunctionInterface* createWeightingFunction(
ErrorBufferInterface* errorhnd);
}
#endif
Source file
Exported function for creating a weighting function object for calculating the inverse of the minimal window size:
WeightingFunctionInterface* createWeightingFunction(
ErrorBufferInterface* errorhnd)
{
try
{
return new MinWinWeightingFunction( errorhnd);
}
CATCH_ERROR_MAP_RETURN( *errorhnd, 0, "in create function");
}
Summarizers
Summarizers are used to extract content from document. Content can be referenced by posting iterators from features or subexpression matches (attached to variables) that are passed to the summarizer as feature parameters. The most convenient type of summarizer is the abstract of a query result document. But summarizers are also used for attaching attributes like the document title to a result element. We provide a summarizer here that returns the passage marked by the minimal window containing at least a minimum subset of query features.
The pattern for creating a summarizer context for doing summarization of a document in a query is the same as for weighting functions. The object needed to calculate a summarizer result is created in 3 steps, each implemening one level of parametrization. A module defining a summarizer function exports a function that creates a summarizer function interface instantiated with the error buffer interface. The next level is to create a summarizer function instance interface. This instance of a summarizer function can be parameterized using the addNumericParameter
and the addStringParameter
methods. The next and final level is to create an execution context interface of the summarizer function. The context is the object for parameterizing the summarizer function with objects related to a specific query issued in a specific storage state. The execution context of the function is used to calculate the sumarization elements for a document with the method
std::vector<SummaryElement> getSummary( const Index& docno);
where SummaryElement
is a structure containing a string and a weight.
The following subsection shows the implementation of all parametrization steps of a summarizer function returning the text of the first occurrence of a minimal window containing at least a defined number of query features as summary element.
Macros for error handling
We already introduced the error handling and exception-free interfaces of Strus and provided a similar example for our posting join operator and the weighting function. The same type of macro we define here:
#define CATCH_ERROR_MAP_RETURN( HND, VALUE, MSG)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in minimal window summarizer function");\
return VALUE;\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "%s (minimal window summarizer function): %s", MSG, err.what());\
return VALUE;\
}
#define CATCH_ERROR_MAP( HND, MSG)\
catch( const std::bad_alloc&)\
{\
(HND).report( "out of memory in minimal window summarizer function");\
}\
catch( const std::runtime_error& err)\
{\
(HND).report( "%s (minimal window summarizer function): %s", MSG, err.what());\
}
Summarizer function context
The summarizer takes similar parameters than the weighting function for the minimal window presented. Additionally we need the feature type of the forward index for content extraction and the storage to access the forward index. The getSummary
method can be compared with the MinWinWeightingFunctionContext::call
method except that we do not calculate a weight from the minimal window size. We concatenate all terms of the defined type in the forward index that are inside the first minimal window found and return the resulting string as summary result. The name of the summary result is set hardcoded to "minwin" for simplicity.
class MinWinSummarizerFunctionContext
:public SummarizerFunctionContextInterface
{
public:
MinWinSummarizerFunctionContext(
ErrorBufferInterface* errorhnd_,
const StorageClientInterface* storage_,
int range_,
unsigned int cardinality_,
const std::string& type_)
:m_errhnd(errorhnd_)
,m_forwardindex(storage_->createForwardIterator( type_))
,m_range(range_)
,m_cardinality(cardinality_)
{}
virtual ~MinWinSummarizerFunctionContext(){}
virtual void addSummarizationFeature(
const std::string& name_,
PostingIteratorInterface* postingIterator_,
const std::vector<SummarizationVariable>& ,
float ,
const TermStatistics& )
{
try
{
if (name_ == "match")
{
m_arg.push_back( postingIterator_);
}
else
{
throw std::runtime_error( "unknown summarization feature name");
}
}
CATCH_ERROR_MAP( *m_errhnd, "in add summarization feature");
}
virtual std::vector<SummaryElement> getSummary( const Index& docno)
{
try
{
m_forwardindex->skipDoc( docno);
std::vector<SummaryElement> rt;
std::vector<PostingIteratorInterface*>::const_iterator
ai = m_arg.begin(), ae = m_arg.end();
std::vector<PostingIteratorInterface*> matches;
matches.reserve( m_arg.size());
for (; ai != ae; ++ai)
{
if (docno == (*ai)->skipDoc( docno))
{
matches.push_back( *ai);
}
}
PositionWindow win( matches, m_range, m_cardinality, 0);
unsigned int minwinsize = m_range+1;
Index minwinpos = 0;
bool more = win.first();
for (;more; more = win.next())
{
unsigned int winsize = win.size();
if (winsize < minwinsize)
{
minwinsize = winsize;
minwinpos = win.pos();
}
}
if (minwinsize < (unsigned int)m_range)
{
std::string text;
Index pos = minwinpos;
while (minwinsize)
{
if (pos == m_forwardindex->skipPos( pos))
{
if (!text.empty()) text.push_back( ' ');
text.append( m_forwardindex->fetch());
}
else
{
text.append( "..");
}
--minwinsize;
++pos;
}
rt.push_back( SummaryElement( "minwin", text));
}
return rt;
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, std::vector<SummaryElement>(), "in call");
}
virtual std::string debugCall( const Index& docno)
{
return std::string();
}
private:
ErrorBufferInterface* m_errhnd;
Reference<ForwardIteratorInterface> m_forwardindex;
std::vector<PostingIteratorInterface*> m_arg;
int m_range;
unsigned int m_cardinality;
};
Summarizer function instance
The summarizer function instance implementation is similar to the weighting function instance, except that we have to handle an additional string parameter "type" (the forward index type of the terms to extract).
class MinWinSummarizerFunctionInstance
:public SummarizerFunctionInstanceInterface
{
public:
MinWinSummarizerFunctionInstance( ErrorBufferInterface* errhnd_)
:m_errhnd(errhnd_),m_range(1000),m_cardinality(0),m_type(){}
virtual ~MinWinSummarizerFunctionInstance(){}
virtual void addStringParameter( const std::string& name, const std::string& value)
{
try
{
if (name == "type")
{
m_type = value;
}
else
{
throw std::runtime_error( std::string( "unknown string parameter ") + name);
}
}
CATCH_ERROR_MAP( *m_errhnd, "in add string parameter");
}
virtual void addNumericParameter( const std::string& name, const ArithmeticVariant& value)
{
try
{
if (name == "maxwinsize")
{
m_range = value.toint();
if (m_range <= 0)
{
throw std::runtime_error("proximity range parameter negative or null");
}
}
else if (name == "cardinality")
{
if (value.type != ArithmeticVariant::UInt && value.type != ArithmeticVariant::Int)
{
throw std::runtime_error("illegal cardinality parameter");
}
m_cardinality = value.touint();
}
else
{
throw std::runtime_error( std::string( "unknown numeric parameter ") + name);
}
}
CATCH_ERROR_MAP( *m_errhnd, "in add numeric parameter");
}
virtual SummarizerFunctionContextInterface* createFunctionContext(
const StorageClientInterface* storage_,
MetaDataReaderInterface* ,
const GlobalStatistics& ) const
{
try
{
return new MinWinSummarizerFunctionContext(
m_errhnd, storage_, m_range, m_cardinality, m_type);
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create function context");
}
virtual std::string tostring() const
{
try
{
std::ostringstream rt;
rt << "type=" << m_type
<< ", maxwinsize=" << m_range
<< ", cardinality=" << m_cardinality;
return rt.str();
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, std::string(), "in tostring");
}
private:
ErrorBufferInterface* m_errhnd;
std::vector<PostingIteratorInterface*> m_arg;
int m_range;
unsigned int m_cardinality;
std::string m_type;
};
Summarizer function
As for the weighting object we have an unparameterized function object with a description and a virtual constructor method to create a parameterizable function instance.
class MinWinSummarizerFunction
:public SummarizerFunctionInterface
{
public:
MinWinSummarizerFunction( ErrorBufferInterface* errhnd_)
:m_errhnd(errhnd_)
{}
virtual ~MinWinSummarizerFunction(){}
virtual SummarizerFunctionInstanceInterface* createInstance(
const QueryProcessorInterface* ) const
{
try
{
return new MinWinSummarizerFunctionInstance( m_errhnd);
}
CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create instance");
}
virtual FunctionDescription getDescription() const
{
typedef FunctionDescription::Parameter P;
FunctionDescription rt("Get the passage of the forward index inside the "
"minimal window containing a subset of the document features");
rt( P::Feature, "match",
"defines the query features to find in a window");
rt( P::Numeric, "maxwinsize",
"the maximum size of a window to search for");
rt( P::Numeric, "cardinality",
"the number of features to find at least in a window");
rt( P::String, "type",
"forward index feature type for building the result");
return rt;
}
private:
ErrorBufferInterface* m_errhnd;
};
Interface for the module
The module we will create needs the a constructor function to create the weighting function object instantiated with the error buffer interface of the application.
Header file
#ifndef _STRUS_SUMMARIZER_FUNCTION_MINWIN_HPP_INCLUDED
#define _STRUS_SUMMARIZER_FUNCTION_MINWIN_HPP_INCLUDED
namespace strus {
class SummarizerFunctionInterface;
class ErrorBufferInterface;
SummarizerFunctionInterface* createMinWinSummarizerFunction(
ErrorBufferInterface* errorhnd);
}
#endif
Source file
SummarizerFunctionInterface* createMinWinSummarizerFunction(
ErrorBufferInterface* errorhnd)
{
try
{
return new MinWinSummarizerFunction( errorhnd);
}
CATCH_ERROR_MAP_RETURN( *errorhnd, 0, "in create function");
}
The assembled module
We have now shown one example for each of the components posting set join operator, weighting function and summarizer. What is left to introduce is how these example components are assembled to a dynamically loadable module (shared object). The module concept of Strus does not involve any magic macros to create a module. You simply define the module entrypoint as data structure with all your objects defined there. Arrays of object are terminated by a NULL tuple. They can be empty (just containing the NULL tuple termination marker).Here is our module source:
#include "strus/private/dll_tags.hpp"
#include "strus/storageModule.hpp"
#include "window_joinop.hpp"
#include "minwin_weighting.hpp"
#include "minwin_summarizer.hpp"
using namespace strus;
static const PostingIteratorJoinConstructor postingJoinOperators[] =
{
{"window", createWindowJoinOperator},
{0,0}
};
static const WeightingFunctionConstructor weightingFunctions[] =
{
{"minwin", createMinWinWeightingFunction},
{0,0}
};
static const SummarizerFunctionConstructor summarizers[] =
{
{"minwin", createMinWinSummarizerFunction},
{0,0}
};
extern "C" DLL_PUBLIC strus::StorageModule entryPoint;
strus::StorageModule entryPoint( postingJoinOperators, weightingFunctions, summarizers);
Build & Installation
See the section prerequisites for what you need. There is a CMake file provided with the sources. A test program is also there. Strus modules are installed in the subdirectory modules of the strus library path, e.g. /usr/lib/strus/modules/modstrus_storage_example.so.
Modules can be used by any program or language binding that supports modules.
Resume
We introduced three types of possible extensions of the Strus core for query evaluation. The posting join operations, weighting functions and summarizers. The article could not show the whole potential of every type of extension. It did not discuss possible applications that can be built with these elements. But if you read this article you might got a first impression about the expandability of Strus with your own dynamically loadable modules. Maybe you will implement your own query evaluation scheme and some useful summarizers soon.