Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C++

How to extend a Strus search engine with new functions

4.00/5 (5 votes)
27 Sep 2017CPOL18 min read 20.6K   94  
This article shows the expandability of Strus with dynamically loadable modules written in C++.

Introduction

This article should help you to get into writing extensions for Strus. Extensions can be separately built and installed as dynamically loadable modules (shared objects). The goal of this article is to enable you to implement, build and install extensions for Strus on your own.

We introduce the interfaces for operations on sets of postings, weighting functions and summarization and show an example module implementing one of each type of extension. The possibility to load an alternative key/value store database implementation as module is not discussed in this article. 

Background

Strus is a set of libraries and tools to build scalable search engines. For tutorials that show how to build a search engine with Strus have a look at my previous code project articles.

Remark

The sources in this article do not implement all methods of the interface. The examples are outdated but good enough as documentation. The sources to download are up-to-date though. 

Table of Contents

  1. Description of the example problem to solve
  2. Error handling (exceptions) in Strus
  3. Posting join operators
  4. Posting join operator implementation
  5. Weighting functions and their implementation
  6. Summarizers and their implementation
  7. The assembled module
  8. Build & installation
  9. Resume

Glossary

  • Posting: For each term, we have a list that records which documents the term occurs in. Each item in the list - which records that a term appeared in a document (and, later, often, the positions in the document) - is conventionally called a posting. [nlp.stanford.edu]
  • Feature: Information extracted from an object and used during query processing. [people.ischool.berkeley.edu]
  • Forward index: The forward index stores a list of words for each document [en.wikipedia.org].

Prerequisites

For compiling and running the sources explained here, you need either to install the strus packages strusBase,strus,strusAnalyzer and strusModule or to build the software on your own. The dependency to strusAnalyzer is due to the fact that strusModule is dependent on it. Analyzer modules are not subject of this article. There will be another article introducing strus analyzer modules.

For cloning the project or downloading the latest sources, go to strusBasestrus, strusAnalyzer, strusModule

You need gcc or clang and cmake for building the software. The current build status of the master branch of the strus projects can be inpsected here.

Example problem to solve

The example join operator implemented here is finding areas of a given size (range of positions) in documents containing at least a subset of given size of all argument features. The example weighting function is returning the inverse of the minimum size of such a window. The example summarizer is returning all forward index features inside of this window.

The presented example functions are not part of the Strus core. They are also not considered to be worth to get into the core, because ranking of documents based on weighting with the minimal window size is not stable. Not stable in this context means that extending a query with new query features shuffles preceding result lists instead of just adding weight to documents matching the new features. This makes search as a process of query refinement difficult for the user. I have chosen these example functions because they are relatively complex, but easy to understand on the other hand.

Example illustrated

This picture marks a window of size 7 containing the two terms "building" and "Dubai":

The worlds tallest building is the Burj Khalifa in Dubai, United Arab Emirates.

Error handling

For writing extension modules for Strus, you have to understand how errors are handled in Strus. Unfortunately for C++ there exists no portable way of throwing exceptions across module boundaries. The only way to have exceptions thrown across module boundaries is to compile all sources involved with the same compiler and with all symbols exported. [See C++ Coding Standards: 101 Rules, Guidelines, and Best Practices, 62. Don't allow exceptions to propagate across module boundaries] But this would impede the deployment of the Strus libraries and modules as binary artifacts. That's why interfaces in Strus are exception-free.

Strus approaches the problem with an interface for reporting errors that is passed down to all objects implementing interfaces that might be called accross module boundaries. This potentially affects any interface of Strus. Therefore any method of an interface has to catch any possible exception thrown and report the error message of the exception via the ErrorBufferInterface declared on object construction. Because all objects of Strus implementing an interface are constructed by another interface object with a create method starting with the root object interface StorageObjectBuilderInterface, the error buffer interface does not appear in Strus interfaces. But you have to be aware of it and pass it down to all objects created by your module. An interface method has to define the try/catch frame in the method and the moving of the exception content to the error buffer object. This is usually done by preprocessor macros, that help to reduce boilerplate code and improve readability. Keep in mind that any code that potentially allocates memory might throw an exception that has to be caught.

The following code snippet shows the definition of 2 macros, one for methods returning a value and one for methods not returning a value:

#define CATCH_ERROR_MAP_RETURN( HND, VALUE)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in ...");\
        return VALUE;\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "error in ...: %s", err.what());\
        return VALUE;\
    }\

#define CATCH_ERROR_MAP( HND)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in window iterator");\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "error in ...: %s", err.what());\
    }\

And here we have an example method implementation assuming m_errorhnd

as pointer to the ErrorBufferInterface of the class.

virtual Index skipPos( const Index& posno_)
{
    try
    {
        ... do something that might throw a bad alloc or runtime error
        return <RESULT>;
    }
    CATCH_ERROR_MAP_RETURN( *m_errorhnd, 0)
}

You also have to be aware of inspecting the error buffer after calling an interface function when it is needed. Sometimes it is not important to detect an error immediately. It is considered as best practice to check for errors when calling methods that return a value and in methods that have to check for error to ensure consistency (e.g. a transaction commit). To allow late detection of errors helps to avoid network roundtrips when calling methods via a remote interface.

Internationalization

The Strus core modules use gettext for internationalization (error messages in a specific language). This aspect is not handled in this example module. Have a look at the core modules to get an idea of how internationalization could be handled.

Posting join operators

We start with the posting join operator implementation. This is the interface most difficult to implement. There are rare cases, where you really want to implement one on your own. But it is good to read the following section to understand what posting join operators are.

Basic model

The posting join operators implement the set operations on postings. Sets of postings are defined as

{(d,p) | d and p are positive natural numbers}

where d represents an internal document number and p a term position.

A posting join operator is a set operation on a list of sub expression posting sets. The set operation can be parameterized with two additional arguments:

  • The range, specifying a proximity condition of the positions of the argument sets involved
  • The cardinality, defining the cardinality of a selection of arguments needed to build a result element.

The parameters range and cardinality are used for a more intiutive construction of complex expressions.

The basic structure of the operations is the Boolean algebra of the set of postings with two additional unary operators. One is the predecessor set:

pred(A) = {(d,p) | (d,p+1) element of A}

The other is the successor set:

succ(A) = {(d,p)|(d,p-1) element of A}

The predecessor and the successor set construction are used to relate productions of a regular language with terms as alphabet to set operations:

A -> Bc => A = A intersect pred(c)

A -> cB => A= B intersect succ(c)

But expressing everything in a language having only the basic regular productions for expressing patterns would be very cumbersome. The range parameter can and should only be used for defining higher level expressions with a more intiutive notion of proximity. Like for example "return me the occurrencies of three terms appearing in the same sentence", where the sentence structure is constructed with a delimiter represented as term and thus as set of postings.

The cardinality parameter on the other hand helps to construct patterns that match for any selection of a given size of the argument terms or expressions. The number of possible selections of subsets tends to grow fast while the result set they could produce may be calculated in a more efficient way.

If you define the parameters range and cardinality for your posting join operations, you should keep this relation to the basic structure in mind. This strict definition of the meaning of these parameters is also the reason why the parameters are fixed positional parameters of the operator constructor.

Representation of the model

The sets of postings are not represented as data, but as iterator with skip operations to the least upperbound document number or position. The two skip operations provided are named skipDoc and skipPos.

If you are familiar with Lucene, you will recognice these least upperbound skip operations as methods named advance there.

The fetching of the next element after (d,p) in the set of postings would be implemented as follows (PsuedoCode):

FUNCTION next( d, p )
    p = skipPos(p+1)
    while (!p) do
       d = skipDoc(++d)
       if (!d) then
          return NULL,NULL
       else
          p = skipPos(0)
       end
    end
    return d,p
end

But this way of iterating through the set of postings is rarely used. In most cases you need to advance to a documents of interest. Then, in the context of that document, you need the skipPos operation to iterate on the elements.

An posting iterator is implemented as interface that either fetches the postings associated with a term in the storage or an object that implements its upperbound skip operations as function on the argument posting iterators.

The following pseudocode illustrates the implementation of the skipDoc and skipPos methods of an iterator representing the intersection of two posting sets arg1 and arg2. We use a helper method skipDocCandidate, that does an intersection on the document numbers only, thus returning a candidate that might not contain a valid match. We then use skipPos to ensure that skipDoc returns a document number referring to a non empty set of postings.

FUNCTION skipDocCandidate( self, d )
    d1 = self->arg1->skipDocCandidate( d)
    d2 = self->arg2->skipDocCandidate( d)
    while (d1 != d2 && d1 && d2) do
       if (d1 < d2) then 
          d1 = self->arg1->skipDocCandidate( d2 )
       else
          d2 = self->arg2->skipDocCandidate( d1 )
       end        
    end
    return d1
end

FUNCTION skipPos( self, p )
    p1 = self->arg1->skipPos( p)
    p2 = self->arg2->skipPos( p)
    while (p1 != p2 && p1 && p2) do
       if (p1 < p2) then 
          p1 = self->arg1->skipPos( p2 )
       else
          p2 = self->arg2->skipPos( p1 )
       end        
    end
    return p1
end

FUNCTION skipDoc( self, d )
   d = skipDocCandidate( self, d)
   while (d && !skipPos( self, 0) do
      d = skipDocCandidate( self, d+1 )
   end
end

The method skipDoc iterates on candidate documents that have both matches or arg1 and arg2 until it finds a document containing a match with identical positions.

C++ interface

Posting iterator

Iterators on postings are represented as iterator interface with the methods to advance to the least upperbound of the passed argument as already introduced in the previous section.

For the document number:

Index skipDoc( const Index& docno)

and for the position

Index skipPos( const Index& posno)

and the helper for the document candidates

Index skipDocCandidate( const Index& docno)

The method skipDocCanditate is a function that does not have to return a valid result. What it must not do is to miss a result. It is used to reduce the number of inspected positions and thus reducing reads on data blocks containing position information. If you for example implement an operator that returns occurrences of subsequent terms, you can return the intersection of the skipDocCandidate results of the arguments as result of your skipDocCandidate. This can lead to a significant reduction of position block reads. Your skipDoc implementation that has to return valid results on the other hand, can call its own skipDocCandidate first, before inspecting the positions to ensure a valid result.

The method skipDocCanditate is only used internally by iterators for the implementation of skipDoc. The resulting iterator on the set of postings is implemented with alernating calls of skipDoc and skipPos.

Other methods to implement are

const char* featureid() const;

that returns a unique id of the iterator used for debugging and tracing.

Index documentFrequency() const;

that returns the document frequency or an upper bound estimate of it. An upper bound estimate is returned in the case a full table scan on the position blocks would be needed to calculate the real value.

unsigned int frequency();

that returns the number of postings in the current document.

Index docno() const;

that returns the document number of the currently visited document.

Index posno() const;

that returns the position of the currently visited posting.

Index length() const;

that returns the length the currently visited posting. The length is not part of the model but useful for post processing of matches as for example markup of matches. It is not considered in expressions like sequence. This is counterintuitive but still not considered because it cannot be modeled with pure set operations.

You also need to implement a constructor taking the parameters and the error buffer interface as arguments

MyPostingIterator(
            const std::vector< strus::Reference< strus::PostingIteratorInterface> >& args,
            unsigned int range_,
            unsigned int cardinality_,
            strus::ErrorBufferInterface* errorhnd_)

Posting join operator

Posting join operators are represented as posting join operator interface implementing a virtual constructor method for creating a posting iterator as iterator on the operation result. You can create an iterator on postings from a list of sub expression posting iterators, the range (proximity) and the cardinality (selector for permutations on input subsets) argument. Passing 0 as cardinality means the default, that all arguments are needed to build a result element. The create method looks as follows:

PostingIteratorInterface* createResultIterator(
            const std::vector<Reference<PostingIteratorInterface> >& argitrs,
            int range,
            unsigned int cardinality) const;

Additionaly the exists a method for getting the description of the operator for introspection:

Description getDescription() const;

The returned description object has a method

const std::string& text() const

to get the description string of this join operator.

 

Example implementation

The following example implements the posting join operator and its result iterator that returns the start of a window within a maximum range (proximity) containing at least the number of argument postings specified by the cardinality argument. First we define some helper functions:

Macros for error handling

We already introduced the error handling and exception-free interfaces of Strus in a previous section. The following macro is a helper for boilerplate code catching exceptions and reporting them via an error buffer interface, returning an undefined value. The caller of the function can check for a failed operation by inspecting
the ErrorBufferInterface passed to the object (ErrorBufferInterface::hasError()):

#define CATCH_ERROR_MAP_RETURN( HND, VALUE, MSG)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in window iterator");\
        return VALUE;\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "error in minwin window iterator %s: %s", MSG, err.what());\
        return VALUE;\
    }

Procedure to get the first/next document match

The following function tries to find a subset of args with a size of at least cardinality with each element having the same upper bound of docno as match. It returns the smallest fit of such a least upper bound of docno as result.
With allowEmpty it is specified, if the document returned should only be a candidate element, that may result in an empty set, or if it should only return documents with real matches.

///\param[in] args argument posting iterators
///\param[in] docno result is an upper bound of this document number
///\param[in] allowEmpty allow estimates (skipDocCandidate) that may not be part of a result
///\param[in] cardinality minimum number of matching elements of args to form a result
///\param[out] candidate_set returned bitfield with matching elements of args

static Index getFirstAllMatchDocnoSubset(
        std::vector<Reference< PostingIteratorInterface> >& args,
        Index docno,
        bool allowEmpty,
        std::size_t cardinality,
        std::vector<PostingIteratorInterface*>& candidate_set)
{
    if (args.empty()) return 0;
    candidate_set.clear();

    Index docno_iter = docno;
    for (;;)
    {
        std::vector<Reference< PostingIteratorInterface> >::iterator
            ai = args.begin(), ae = args.end();

        std::size_t nof_matches = 0;
        Index match_docno = 0;

        // Iterate on all arguments:
        for (; ai != ae; ++ai)
        {
            // Try to get another candidate document number:
            Index docno_next = (*ai)->skipDocCandidate( docno_iter);
            if (docno_next)
            {
                // ... if another element found:
                if (match_docno)
                {
                    // ... there are elements in the current candidate:
                    if (match_docno == docno_next)
                    {
                        // ... it is a member of the current candidate set
                        candidate_set.push_back( ai->get());
                        ++nof_matches;
                    }
                    else if (match_docno > docno_next)
                    {
                        // ... it is smaller than the members
                        // of the current candidate set. Recreate
                        // the current candidate set with the element
                        // found as only member:
                        candidate_set.clear();
                        candidate_set.push_back( ai->get());
                        match_docno = docno_next;
                        nof_matches = 1;
                    }
                }
                else
                {
                    // ... there are no elements yet
                    // in the current candidate set:
                    candidate_set.clear();
                    candidate_set.push_back( ai->get());
                    match_docno = docno_next;
                    nof_matches = 1;
                }
            }
        }
        if (nof_matches >= cardinality)
        {
            // ... we have a candidate set big enough:
            if (!allowEmpty)
            {
                // ... we need to check the arguments to be real matches:
                std::vector<PostingIteratorInterface*>::iterator
                    ci = candidate_set.begin(),
                    ce = candidate_set.end();
                while (ci != ce)
                {
                    if (match_docno != (*ci)->skipDoc( match_docno))
                    {
                        --nof_matches;
                        if (nof_matches < cardinality) break;
                        candidate_set.erase( ci);
                    }
                    else
                    {
                        ++ci;
                    }
                }
                if (nof_matches < cardinality)
                {
                    // ... we did not get a result with the cardinality requested -> start again
                    docno_iter = match_docno+1;
                    candidate_set.clear();
                    continue;
                }
            }
            return match_docno;
        }
        else if (match_docno)
        {
            docno_iter = match_docno+1;
        }
        else
        {
            break;
        }
    }
    return 0;
}

Position window

The following helper class implements an iterator on windows of a given maximum size that contain a subset of a specified maximum size of the argument features. The class called PositionWindow is used in all functions exported by the module presented in this article. So it will be reused for the weighting function and the summarizer implementation.

Header file
#ifndef _STRUS_POSITION_WINDOW_HPP_INCLUDED
#define _STRUS_POSITION_WINDOW_HPP_INCLUDED
#include "strus/index.hpp"
#include "strus/postingIteratorInterface.hpp"
#include <set>

namespace strus {

// Sliding window to iterate on the sets of posting iterator elements
// that are withing a defined proximity range:
class PositionWindow
{
public:
    // Element in the sliding position window implemented as set:
    struct Element
    {
        PostingIteratorInterface* itr;    // ... reference this element belongs to
        Index pos;            // ... position of this element

        Element()
            :itr(0),pos(0){}
        Element( PostingIteratorInterface* itr_, Index pos_)
            :itr(itr_),pos(pos_){}
        Element( const Element& o)
            :itr(o.itr),pos(o.pos){}

        bool operator < ( const Element& o) const
        {
            if (pos == o.pos)
            {
                return itr < o.itr;
            }
            return pos < o.pos;
        }
    };

    // Constructor that fills the sliding window implemented as set
    // with the argument element start positions:
    PositionWindow(
        const std::vector<PostingIteratorInterface*>& args,
        unsigned int range_,
        unsigned int cardinality_,
        Index firstpos_);

    // Check if there is any valid window:
    bool first();

    // Skip to the next window and return true, if there is one more:
    bool next();

    // Return the size of the current window:
    unsigned int size() const;

    // Return the starting position of the current window:
    unsigned int pos() const;

private:
    // Get the top element of the current window:
    std::set<Element>::iterator getWinTopElement();

private:
    std::vector<PostingIteratorInterface*> m_args;
    std::set<Element> m_set;       // ... set used for sliding window
    unsigned int m_setsize;        // ... current size of m_set
    unsigned int m_range;          // ... maximum proximity range
    unsigned int m_cardinality;    // ... number of elements for a candidate window
};

}//namespace
#endif
Source file
#include "positionWindow.hpp"
#include "strus/postingJoinOperatorInterface.hpp"
#include "strus/postingIteratorInterface.hpp"
#include "strus/errorBufferInterface.hpp"
#include <cstdio>

using namespace strus;

// Constructor that fills the sliding window implemented as set
// with the argument element start positions:
PositionWindow::PositionWindow(
        const std::vector<PostingIteratorInterface*>& args,
        unsigned int range_,
        unsigned int cardinality_,
        Index firstpos_)
    :m_args(args)
    ,m_setsize(0)
    ,m_range(range_)
    ,m_cardinality(cardinality_>0?cardinality_:args.size())
{
    std::vector<PostingIteratorInterface*>::iterator
        ai = m_args.begin(), ae = m_args.end();
    for (; ai != ae; ++ai)
    {
        PostingIteratorInterface* itr = *ai;
        Index pos = itr->skipPos( firstpos_);
        if (pos)
        {
            m_set.insert( Element( itr, pos));
            ++m_setsize;
        }
    }
}

// Get the top element of the current window:
std::set<PositionWindow::Element>::iterator PositionWindow::getWinTopElement()
{
    if (m_cardinality == 0) return m_set.end();
    unsigned int ii=0;
    std::set<Element>::iterator
        si = m_set.begin(), se = m_set.end();
    for (; ii<m_cardinality && si != se; ++ii,++si){}
    return --si;
}

// Check if there is any valid window:
bool PositionWindow::first()
{
    // Return, if there is valid window left:
    return m_setsize >= m_cardinality;
}

// Skip to the next window and return true, if there is one more:
bool PositionWindow::next()
{
    // Calculate how many positions we can skip with the
    // first element without loosing any window of a size
    // within the defined proximity range:
    PostingIteratorInterface* itr = m_set.begin()->itr;
    std::set<Element>::iterator top = getWinTopElement();

    Index posdiff = top->pos - m_set.begin()->pos;
    Index skipsize = posdiff > (Index)m_range ? (posdiff - (Index)m_range) : 1;
    // Change the position of the first element in the
    // sliding window by the calculated size:
    Index pos = m_set.begin()->pos;

    m_set.erase( m_set.begin());

    Index newpos = itr->skipPos( pos + skipsize);
    if (newpos)
    {
        m_set.insert( Element( itr, newpos));
    }
    else
    {
        --m_setsize;
    }
    // Return, if there is valid window left:
    return m_setsize >= m_cardinality;
}

// Return the size of the current window:
unsigned int PositionWindow::size() const
{
    std::set<Element>::iterator si = getWinTopElement();
    return si->pos - m_set.begin()->pos;
}

// Return the starting position of the current window:
unsigned int PositionWindow::pos() const
{
    return (m_setsize >= m_cardinality)?m_set.begin()->pos:0;
}

Now we have everything to implement the posting iterator for the window and its join operator:

Posting iterator

The following class implements the iterator on postings created from the posting sets join operator implemented in this module.

// Window posting iterator class:
class WindowPostingIterator
    :public PostingIteratorInterface
{
public:
    // Constructor:
    WindowPostingIterator(
            const std::vector<Reference< PostingIteratorInterface> >& args,
            unsigned int range_,
            unsigned int cardinality_,
            ErrorBufferInterface* errorhnd_)
        :m_docno(0)
        ,m_posno(0)
        ,m_argar(args)
        ,m_documentFrequency(-1)
        ,m_range(range_)
        ,m_cardinality(cardinality_)
        ,m_errorhnd(errorhnd_)
    {
        // Initializing the featureid:
        std::vector<Reference< PostingIteratorInterface> >::const_iterator
            ai = m_argar.begin(), ae = m_argar.end();
        for (int aidx=0; ai != ae; ++ai,++aidx)
        {
            if (aidx) m_featureid.push_back('=');
            m_featureid.append( (*ai)->featureid());
        }
        m_featureid.append( ".win");
    }

    // Destructor:
    virtual ~WindowPostingIterator(){}

    // The skip document method uses the getFirstAllMatchDocnoSubset introduced
    // and assures with a skip position call, that the result is a real match:
    virtual Index skipDoc( const Index& docno_)
    {
        try
        {
            m_docno = getFirstAllMatchDocnoSubset(
                    m_argar, docno_, false,
                    m_cardinality, m_candidate_set);
            while (m_docno && skipPos(0) == 0)
            {
                m_docno = getFirstAllMatchDocnoSubset(
                        m_argar, m_docno+1, false,
                        m_cardinality, m_candidate_set);
            }
            return m_docno;
        }
        CATCH_ERROR_MAP_RETURN( *m_errorhnd, (m_docno=0), "in skip document")
    }

    // The skip document candidate method uses the getFirstAllMatchDocnoSubset introduced:
    virtual Index skipDocCandidate( const Index& docno_)
    {
        try
        {
            return m_docno=getFirstAllMatchDocnoSubset(
                        m_argar, docno_, true,
                        m_cardinality, m_candidate_set);
        }
        CATCH_ERROR_MAP_RETURN( *m_errorhnd, (m_docno=0), "in skip document candidate")
    }

    // The skip position method uses a sliding window for finding the next match fulfilling
    // the condition of having a set with size cardinality of element withing a proximity range,
    // at a position bigger or equal to the position index passed as argument:
    virtual Index skipPos( const Index& posno_)
    {
        try
        {
            PositionWindow win( m_candidate_set, m_range, m_cardinality, posno_);
            if (!win.first()) return m_posno=0;
            return m_posno=win.pos();
        }
        CATCH_ERROR_MAP_RETURN( *m_errorhnd, (m_posno=0), "in skip position")
    }

    // Return a reference to the identifier built in the constructor:
    virtual const char* featureid() const
    {
        return m_featureid.c_str();
    }

    // The document frequency cannot be calculated without a fullscan of the
    // index for all matching documents also inspecting position.
    // This is considered as too expensive. So is estimating.
    // So we return the minimum document frequency of the arguments.:
    virtual Index documentFrequency() const
    {
        try
        {
            if (m_documentFrequency < 0)
            {
                std::vector<Reference< PostingIteratorInterface> >::const_iterator
                    ai = m_argar.begin(), ae = m_argar.end();
                if (ai == ae) return 0;
        
                m_documentFrequency = (*ai)->documentFrequency();
                for (++ai; ai != ae; ++ai)
                {
                    Index df = (*ai)->documentFrequency();
                    if (df < m_documentFrequency)
                    {
                        m_documentFrequency = df;
                    }
                }
            }
            return m_documentFrequency;
        }
        CATCH_ERROR_MAP_RETURN( *m_errorhnd, 0, "in document frequency estimation")
    }

    // The frequency is calculated by iterating on all
    // elements of the current document and counting the number of positions:
    virtual unsigned int frequency()
    {
        Index idx=0;
        unsigned int rt = 0;
        for (;0!=(idx=skipPos( idx)); ++idx,++rt){}
        return rt;
    }

    // Return the current document number:
    virtual Index docno() const
    {
        return m_docno;
    }

    // Return the current position number:
    virtual Index posno() const
    {
        return m_posno;
    }

    // Return the length
    virtual Index length() const
    {
         return 1; //... this value is not correct, we should return the current window size 
    }

private:
    // Current document:
    Index m_docno;
    // Current position:
    Index m_posno;
    // Argument iterators:
    std::vector<Reference< PostingIteratorInterface> > m_argar;
    // Unique id of the feature expression for debugging and tracing:
    std::string m_featureid;
    // Cached Document frequency (of the rarest subexpression):
    mutable Index m_documentFrequency;
    // Required proximity range of the candicate windows:
    unsigned int m_range;
    // Required cardinality of the result:
    unsigned int m_cardinality;
    // Current set of matching document candidates:
    std::vector<PostingIteratorInterface*> m_candidate_set;
    // Buffer for error messages (for exception free interfaces):
    ErrorBufferInterface* m_errorhnd;
};

Posting join operator

The following class implements the constructor of a posting iterator from the posting set join operator arguments:

// Window posting join operator class:
class WindowPostingJoinOperator
    :public PostingJoinOperatorInterface
{
public:
    // Constructor:
    WindowPostingJoinOperator( ErrorBufferInterface* errorhnd_)
        :m_errorhnd(errorhnd_){}

    // Destructor:
    virtual ~WindowPostingJoinOperator(){}

    // Virtual constructor of the resulting posting iterator
    virtual PostingIteratorInterface* createResultIterator(
            const std::vector<Reference<PostingIteratorInterface> >& argitrs,
            int range,
            unsigned int cardinality) const
    {
        try
        {
            if (range < 0)
            {
                throw std::runtime_error(
                    "no negative range allowed for window iterator");
            }
            return new WindowPostingIterator(
                    argitrs, (unsigned int)range,
                    cardinality, m_errorhnd);
        }
        CATCH_ERROR_MAP_RETURN( *m_errorhnd, 0, "in create result iterator")
    }

    // Return the Description of the operator for user help and introspection:
    virtual Description getDescription() const
    {
        try
        {
            return Description(
                "iterator on windows of a maximum size (range) "
                "containing a defined subset of features (cardinality)");
        }
        CATCH_ERROR_MAP_RETURN( *m_errorhnd, Description(), "in get description")
    }

private:
    ErrorBufferInterface* m_errorhnd;
};

Interface for the module

Header file

The module we will create needs the a constructor function to create the posting set join operator:

#ifndef _STRUS_POSTING_JOIN_OPERATOR_WINDOW_HPP_INCLUDED
#define _STRUS_POSTING_JOIN_OPERATOR_WINDOW_HPP_INCLUDED

namespace strus {

// Forward declarations:
class PostingJoinOperatorInterface;
class ErrorBufferInterface;

// Window iterator constructor:
PostingJoinOperatorInterface* createWindowJoinOperator(
    ErrorBufferInterface* errorhnd);

}//namespace
#endif
Source file
// Exported function constructing the PostingJoinOperatorInterface realizing an
// iterator on windows of a given maximum size:
PostingJoinOperatorInterface* strus::createWindowJoinOperator( ErrorBufferInterface* errorhnd)
{
    try
    {
         return new WindowPostingJoinOperator( errorhnd);
    }    
    CATCH_ERROR_MAP_RETURN( *errorhnd, 0, "in create join operator")
}

Weighting functions

Weighting functions are used to assign weights to documents. They use posting set itetators and some numeric parameters as input to calculate the weight of a document. The weight of a document is used as criterion to sort result documents in the query evaluation. The object needed to calculate a weighting function result is created in 3 steps, each implemening one level of parametrization. A module defining a weighting function exports a function that creates a weighting function interface instantiated with the error buffer interface. The next  level is to create a weighting function instance interface. This instance of a weighting function can be parameterized using the addNumericParameter and the addStringParameter methods. The next and final level is to create an execution context interface of the weighting function. The context is the object is used to parameterize the weighting function with objects related to a specific query issued in a specific storage state. The execution context of the function is used to calculate the weight for a document with the method

double call( const Index& docno);

The following subsection shows the implementation of all parametrization steps of a weighting function calculating the inverse of the minimal window containing at least a defined number of query features.

We start with the execution context class:

Macros for error handling

We already introduced the error handling and exception-free interfaces of Strus and provided a similar example for our posting join operator. The same type of macro we define here:

#define CATCH_ERROR_MAP_RETURN( HND, VALUE, MSG)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in window weighting function");\
        return VALUE;\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "%s (minwin window weighting function): %s", MSG, err.what());\
        return VALUE;\
    }
#define CATCH_ERROR_MAP( HND, MSG)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in window weighting function");\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "%s (minwin window weighting function): %s", MSG, err.what());\
    }

Execution context

The weighting function execution context is parameterized with the features to use for weighting in the query. It gets the global statistics and an interface to access the storage and the metadata table with the constructor called from the function instance. The execution context uses the class PositionWindow we defined in the section describing the posting iterator.

class MinWinWeightingFunctionContext
    :public WeightingFunctionContextInterface
{
public:
    MinWinWeightingFunctionContext(
            ErrorBufferInterface* errhnd_, int range_, unsigned int cardinality_)
        :m_errhnd(errhnd_),m_range(range_),m_cardinality(cardinality_)
    {}

    virtual ~MinWinWeightingFunctionContext(){}

    virtual void addWeightingFeature(
            const std::string& name_,
            PostingIteratorInterface* postingIterator_,
            float /*weight_*/,
            const TermStatistics& /*stats_*/)
    {
        try
        {
            if (name_ == "match")
            {
                m_arg.push_back( postingIterator_);
            }
            else
            {
                throw std::runtime_error( "unknown weighting feature name");
            }
        }
        CATCH_ERROR_MAP( *m_errhnd, "in add weighting feature");
    }

    virtual double call( const Index& docno)
    {
        try
        {
            // Initialize the matching features to weight for this document:
            std::vector<PostingIteratorInterface*>::const_iterator
                  ai = m_arg.begin(), ae = m_arg.end();
            std::vector<PostingIteratorInterface*> matches;
            matches.reserve( m_arg.size());
            for (; ai != ae; ++ai)
            {
                if (docno == (*ai)->skipDoc( docno))
                {
                    matches.push_back( *ai);
                }
            }
            // Calculate the minimal window size:
            PositionWindow win( matches, m_range, m_cardinality, 0);
            unsigned int minwinsize = m_range+1;
            bool more = win.first();
            for (;more; more = win.next())
            {
                unsigned int winsize = win.size();
                if (winsize < minwinsize)
                {
                    minwinsize = winsize;
                }
            }
            // Return the weight depending on the minimal window size:
            if (minwinsize < m_range)
            {
                return 1.0/(minwinsize+1);
            }
            else
            {
                return 0.0;
            }
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, 0.0, "in call");
    }

    virtual std::string debugCall( const Index& docno)
    {
        return std::string();
        // ... the source to download contains a debug function implementation.
        //     we do not provide one here, leaving it out makes things easier.
    }

private:
    ErrorBufferInterface* m_errhnd;
    std::vector<PostingIteratorInterface*> m_arg;
    int m_range;
    unsigned int m_cardinality;
};

Function instance

The weighting function execution context is parameterized with numeric and string parameters. It provides the virtual constructor to create a the weigthing function context.

class MinWinWeightingFunctionInstance
    :public WeightingFunctionInstanceInterface
{
public:
    explicit MinWinWeightingFunctionInstance( ErrorBufferInterface* errhnd_)
        :m_errhnd(errhnd_),m_range(1000),m_cardinality(0){}
    virtual ~MinWinWeightingFunctionInstance(){}

    virtual void addStringParameter( const std::string& name, const std::string&)
    {
        try
        {
            throw std::runtime_error( std::string( "unknown numeric parameter ") + name);
        }
        CATCH_ERROR_MAP( *m_errhnd, "in add string parameter");
    }

    virtual void addNumericParameter( const std::string& name, const ArithmeticVariant& value)
    {
        try
        {
            if (name == "maxwinsize")
            {
                m_range = value.toint();
                if (m_range <= 0)
                {
                     throw std::runtime_error("proximity range parameter negative or null");
                }
            }
            else if (name == "cardinality")
            {
                if (value.type != ArithmeticVariant::UInt && value.type != ArithmeticVariant::Int)
                {
                    throw std::runtime_error("illegal cardinality parameter");
                }
                m_cardinality = value.touint();
            }
            else
            {
                throw std::runtime_error( std::string( "unknown numeric parameter ") + name);
            }
        }
        CATCH_ERROR_MAP( *m_errhnd, "in add numeric parameter");
    }

    virtual WeightingFunctionContextInterface* createFunctionContext(
            const StorageClientInterface* /*storage_*/,
            MetaDataReaderInterface* /*metadata_*/,
            const GlobalStatistics& /*stats_*/) const
    {
        try
        {
            return new MinWinWeightingFunctionContext( m_errhnd, m_range, m_cardinality);
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create function context");
    }

    virtual std::string tostring() const
    {
        try
        {
            std::ostringstream rt;
            rt << "maxwinsize=" << m_range << ", cardinality=" << m_cardinality;
            return rt.str();
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, std::string(), "in tostring");
    }

private:
    ErrorBufferInterface* m_errhnd;
    std::vector<PostingIteratorInterface*> m_arg;
    int m_range;
    unsigned int m_cardinality;
};

Weighting function

The weighting function is the unparameterized object that just provides a virtual constructor to create a function instance for parametrization. Additionally it implements a get description for introspection.

class MinWinWeightingFunction
    :public WeightingFunctionInterface
{
public:
    MinWinWeightingFunction( ErrorBufferInterface* errhnd_)
        :m_errhnd(errhnd_)
    {}

    virtual ~MinWinWeightingFunction(){}

    virtual WeightingFunctionInstanceInterface* createInstance(
                    const QueryProcessorInterface*) const
    {
        try
        {
            return new MinWinWeightingFunctionInstance( m_errhnd);
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create instance");
    }

    virtual FunctionDescription getDescription() const
    {
        typedef FunctionDescription::Parameter P;
        FunctionDescription rt("Calculate the document weight as the inverse "
                       "of the minimal window size containing a subset of the document features");
        rt( P::Feature, "match",
                       "defines the query features to find in a window");
        rt( P::Numeric, "maxwinsize",
                       "the maximum size of a window to search for");
        rt( P::Numeric, "cardinality",
                       "the number of features to find at least in a window");
        return rt;
    }

private:
    ErrorBufferInterface* m_errhnd;
};

Interface for the module

Header file

The module we will create needs the a constructor function to create the weighting function object instantiated with the error buffer interface of the application.

#ifndef _STRUS_WEIGHTING_FUNCTION_MINWIN_HPP_INCLUDED
#define _STRUS_WEIGHTING_FUNCTION_MINWIN_HPP_INCLUDED

namespace strus {

// Forward declarations:
class WeightingFunctionInterface;
class ErrorBufferInterface;

// Weighting function constructor:
WeightingFunctionInterface* createWeightingFunction(
    ErrorBufferInterface* errorhnd);

}//namespace
#endif
Source file

Exported function for creating a weighting function object for calculating the inverse of the minimal window size:

WeightingFunctionInterface* createWeightingFunction(
    ErrorBufferInterface* errorhnd)
{
    try
    {
        return new MinWinWeightingFunction( errorhnd);
    }
    CATCH_ERROR_MAP_RETURN( *errorhnd, 0, "in create function");
}

Summarizers

Summarizers are used to extract content from document. Content can be referenced by posting iterators from features or subexpression matches (attached to variables) that are passed to the summarizer as feature parameters. The most convenient type of summarizer is the abstract of a query result document. But summarizers are also used for attaching attributes like the document title to a result element. We provide a summarizer here that returns the passage marked by the minimal window containing at least a minimum subset of query features.

The pattern for creating a summarizer context for doing summarization of a document in a query is the same as for weighting functions. The object needed to calculate a summarizer result is created in 3 steps, each implemening one level of parametrization. A module defining a summarizer function exports a function that creates a summarizer function interface instantiated with the error buffer interface. The next  level is to create a summarizer function instance interface. This instance of a summarizer function can be parameterized using the addNumericParameter and the addStringParameter methods. The next and final level is to create an execution context interface of the summarizer function. The context is the object for parameterizing the summarizer function with objects related to a specific query issued in a specific storage state. The execution context of the function is used to calculate the sumarization elements for a document with the method

std::vector<SummaryElement> getSummary( const Index& docno);

where SummaryElement is a structure containing a string and a weight.

The following subsection shows the implementation of all parametrization steps of a summarizer function returning the text of the first occurrence of a minimal window containing at least a defined number of query features as summary element.

Macros for error handling

We already introduced the error handling and exception-free interfaces of Strus and provided a similar example for our posting join operator and the weighting function. The same type of macro we define here:

#define CATCH_ERROR_MAP_RETURN( HND, VALUE, MSG)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in minimal window summarizer function");\
        return VALUE;\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "%s (minimal window summarizer function): %s", MSG, err.what());\
        return VALUE;\
    }
#define CATCH_ERROR_MAP( HND, MSG)\
    catch( const std::bad_alloc&)\
    {\
        (HND).report( "out of memory in minimal window summarizer function");\
    }\
    catch( const std::runtime_error& err)\
    {\
        (HND).report( "%s (minimal window summarizer function): %s", MSG, err.what());\
    }

Summarizer function context

The summarizer takes similar parameters than the weighting function for the minimal window presented. Additionally we need the feature type of the forward index for content extraction and the storage to access the forward index. The getSummary method can be compared with the MinWinWeightingFunctionContext::call method except that we do not calculate a weight from the minimal window size. We concatenate all terms of the defined type in the forward index that are inside the first minimal window found and return the resulting string as summary result. The name of the summary result is set hardcoded to "minwin" for simplicity.

class MinWinSummarizerFunctionContext
    :public SummarizerFunctionContextInterface
{
public:
    MinWinSummarizerFunctionContext(
            ErrorBufferInterface* errorhnd_,
            const StorageClientInterface* storage_,
            int range_,
            unsigned int cardinality_,
            const std::string& type_)
        :m_errhnd(errorhnd_)
        ,m_forwardindex(storage_->createForwardIterator( type_))
        ,m_range(range_)
        ,m_cardinality(cardinality_)
    {}

    virtual ~MinWinSummarizerFunctionContext(){}

    virtual void addSummarizationFeature(
            const std::string& name_,
            PostingIteratorInterface* postingIterator_,
            const std::vector<SummarizationVariable>& /*variables_*/,
            float /*weight_*/,
            const TermStatistics& /*stats_*/)
    {
        try
        {
            if (name_ == "match")
            {
                m_arg.push_back( postingIterator_);
            }
            else
            {
                throw std::runtime_error( "unknown summarization feature name");
            }
        }
        CATCH_ERROR_MAP( *m_errhnd, "in add summarization feature");
    }

    virtual std::vector<SummaryElement> getSummary( const Index& docno)
    {
        try
        {
            // Initialize the features to weight and the forward index:
            m_forwardindex->skipDoc( docno);
            std::vector<SummaryElement> rt;
            std::vector<PostingIteratorInterface*>::const_iterator
                ai = m_arg.begin(), ae = m_arg.end();
            std::vector<PostingIteratorInterface*> matches;
            matches.reserve( m_arg.size());
            for (; ai != ae; ++ai)
            {
                if (docno == (*ai)->skipDoc( docno))
                {
                    matches.push_back( *ai);
                }
            }
            // Calculate the minimal window size and position:
            PositionWindow win( matches, m_range, m_cardinality, 0);
            unsigned int minwinsize = m_range+1;
            Index minwinpos = 0;
            bool more = win.first();
            for (;more; more = win.next())
            {
                unsigned int winsize = win.size();
                if (winsize < minwinsize)
                {
                    minwinsize = winsize;
                    minwinpos = win.pos();
                }
            }
            // Build the summary phrase and append it to the result:
            if (minwinsize < (unsigned int)m_range)
            {
                std::string text;
                Index pos = minwinpos;
                while (minwinsize)
                {
                    if (pos == m_forwardindex->skipPos( pos))
                    {
                        if (!text.empty()) text.push_back( ' ');
                        text.append( m_forwardindex->fetch());
                    }
                    else
                    {
                        text.append( "..");
                    }
                    --minwinsize;
                    ++pos;
                }
                rt.push_back( SummaryElement( "minwin", text));
            }
            return rt;
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, std::vector<SummaryElement>(), "in call");
    }

    virtual std::string debugCall( const Index& docno)
    {
        //... we do not provide a debug function implementation here
        //    you'll find an implementation in the sources to download
        return std::string();
    }

private:
    ErrorBufferInterface* m_errhnd;
    Reference<ForwardIteratorInterface> m_forwardindex;
    std::vector<PostingIteratorInterface*> m_arg;
    int m_range;
    unsigned int m_cardinality;
};

Summarizer function instance

The summarizer function instance implementation is similar to the weighting function instance, except that we have to handle an additional string parameter "type" (the forward index type of the terms to extract).

class MinWinSummarizerFunctionInstance
    :public SummarizerFunctionInstanceInterface
{
public:
    MinWinSummarizerFunctionInstance( ErrorBufferInterface* errhnd_)
        :m_errhnd(errhnd_),m_range(1000),m_cardinality(0),m_type(){}

    virtual ~MinWinSummarizerFunctionInstance(){}

    virtual void addStringParameter( const std::string& name, const std::string& value)
    {
        try
        {
            if (name == "type")
            {
                m_type = value;
            }
            else
            {
                throw std::runtime_error( std::string( "unknown string parameter ") + name);
            }
        }
        CATCH_ERROR_MAP( *m_errhnd, "in add string parameter");
    }

    virtual void addNumericParameter( const std::string& name, const ArithmeticVariant& value)
    {
        try
        {
            if (name == "maxwinsize")
            {
                m_range = value.toint();
                if (m_range <= 0)
                {
                    throw std::runtime_error("proximity range parameter negative or null");
                }
            }
            else if (name == "cardinality")
            {
                if (value.type != ArithmeticVariant::UInt && value.type != ArithmeticVariant::Int)
                {
                    throw std::runtime_error("illegal cardinality parameter");
                }
                m_cardinality = value.touint();
            }
            else
            {
                throw std::runtime_error( std::string( "unknown numeric parameter ") + name);
            }
        }
        CATCH_ERROR_MAP( *m_errhnd, "in add numeric parameter");
    }

    virtual SummarizerFunctionContextInterface* createFunctionContext(
            const StorageClientInterface* storage_,
            MetaDataReaderInterface* /*metadata_*/,
            const GlobalStatistics& /*stats*/) const
    {
        try
        {
            return new MinWinSummarizerFunctionContext(
                    m_errhnd, storage_, m_range, m_cardinality, m_type);
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create function context");
    }

    virtual std::string tostring() const
    {
        try
        {
            std::ostringstream rt;
            rt << "type=" << m_type
               << ", maxwinsize=" << m_range
               << ", cardinality=" << m_cardinality;
            return rt.str();
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, std::string(), "in tostring");
    }

private:
    ErrorBufferInterface* m_errhnd;
    std::vector<PostingIteratorInterface*> m_arg;
    int m_range;
    unsigned int m_cardinality;
    std::string m_type;
};

Summarizer function

As for the weighting object we have an unparameterized function object with a description and a virtual constructor method to create a parameterizable function instance.

class MinWinSummarizerFunction
    :public SummarizerFunctionInterface
{
public:
    MinWinSummarizerFunction( ErrorBufferInterface* errhnd_)
        :m_errhnd(errhnd_)
    {}

    virtual ~MinWinSummarizerFunction(){}

    virtual SummarizerFunctionInstanceInterface* createInstance(
            const QueryProcessorInterface* /*processor*/) const
    {
        try
        {
            return new MinWinSummarizerFunctionInstance( m_errhnd);
        }
        CATCH_ERROR_MAP_RETURN( *m_errhnd, 0, "in create instance");
    }

    virtual FunctionDescription getDescription() const
    {
        typedef FunctionDescription::Parameter P;
        FunctionDescription rt("Get the passage of the forward index inside the "
                "minimal window containing a subset of the document features");
        rt( P::Feature, "match",
                "defines the query features to find in a window");
        rt( P::Numeric, "maxwinsize",
                "the maximum size of a window to search for");
        rt( P::Numeric, "cardinality",
                "the number of features to find at least in a window");
        rt( P::String, "type",
                "forward index feature type for building the result");
        return rt;
    }

private:
    ErrorBufferInterface* m_errhnd;
};

Interface for the module

The module we will create needs the a constructor function to create the weighting function object instantiated with the error buffer interface of the application.

Header file
#ifndef _STRUS_SUMMARIZER_FUNCTION_MINWIN_HPP_INCLUDED
#define _STRUS_SUMMARIZER_FUNCTION_MINWIN_HPP_INCLUDED

namespace strus {

// Forward declarations:
class SummarizerFunctionInterface;
class ErrorBufferInterface;

// Weighting function constructor:
SummarizerFunctionInterface* createMinWinSummarizerFunction(
    ErrorBufferInterface* errorhnd);

}//namespace
#endif
Source file
SummarizerFunctionInterface* createMinWinSummarizerFunction(
    ErrorBufferInterface* errorhnd)
{
    try
    {
        return new MinWinSummarizerFunction( errorhnd);
    }
    CATCH_ERROR_MAP_RETURN( *errorhnd, 0, "in create function");
}

The assembled module

We have now shown one example for each of the components posting set join operator, weighting function and summarizer. What is left to introduce is how these example components are assembled to a dynamically loadable module (shared object). The module concept of Strus does not involve any magic macros to create a module. You simply define the module entrypoint as data structure with all your objects defined there. Arrays of object are terminated by a NULL tuple. They can be empty (just containing the NULL tuple termination marker).Here is our module source:

#include "strus/private/dll_tags.hpp"

#include "strus/storageModule.hpp"
#include "window_joinop.hpp"
#include "minwin_weighting.hpp"
#include "minwin_summarizer.hpp"

using namespace strus;

static const PostingIteratorJoinConstructor postingJoinOperators[] =
{
    {"window", createWindowJoinOperator},
    {0,0}
};

static const WeightingFunctionConstructor weightingFunctions[] =
{
    {"minwin", createMinWinWeightingFunction},
    {0,0}
};

static const SummarizerFunctionConstructor summarizers[] =
{
    {"minwin", createMinWinSummarizerFunction},
    {0,0}
};

extern "C" DLL_PUBLIC strus::StorageModule entryPoint;

strus::StorageModule entryPoint( postingJoinOperators, weightingFunctions, summarizers);

Build & Installation

See the section prerequisites for what you need. There is a CMake file provided with the sources. A test program is also there. Strus modules are installed in the subdirectory modules of the strus library path, e.g. /usr/lib/strus/modules/modstrus_storage_example.so.

Modules can be used by any program or language binding that supports modules.

Resume

We introduced three types of possible extensions of the Strus core for query evaluation. The posting join operations, weighting functions and summarizers. The article could not show the whole potential of every type of extension. It did not discuss possible applications that can be built with these elements. But if you read this article you might got a first impression about the expandability of Strus with your own dynamically loadable modules. Maybe you will implement your own query evaluation scheme and some useful summarizers soon.

 

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)