High Performance Heterogeneous Container

Alexandre Courpron

4.96/5 (38 votes)

29 Feb 2008CPOL25 min read

526

This article presents the implementation of a fixed-size heterogeneous container.

Download source - 21.37 KB

Introduction
Existing Solutions
tek::record Implementation
Usage
Benchmark
General Conclusion
References
Notes
History

Introduction

A heterogeneous container is a container that can store elements of different types. For strongly typed languages like C++, such kind of container isn't a natural or built-in feature. Many solutions exist though, to simulate this heterogeneous property, but they often involve memory space or runtime speed trade-offs.

This article presents a fixed-size heterogeneous container, named tek::record, that tries to achieve the best possible performances in terms of both size and speed. The goal of the article isn't only to list the tek::record functionalities, but rather to explain how such a solution is implemented, going from basic template metaprogramming concepts to advanced optimization techniques.

Existing Solutions

This chapter lists the most known and popular solutions to implement a heterogeneous container, and their main drawbacks.

Classical Polymorphism

In the classical polymorphism solution, the container holds pointers to a base class from which several classes are derived. The heterogeneous property is then achieved through the dynamic type of each element of the collection.

Limitations:

Virtual functions can't be inlined in this case
Impossibility to directly use built-in types
Loss of type identity and traits specific to the derived classes

Union-like Elements

A container of unions simulates a heterogeneous behavior, since the value of a union can be interpreted through multiple types. However, mere unions have severe limitations:

They only accept a restricted set of types
They aren't type-safe
They must reserve a space at least equivalent to the size of the largest type of the union, resulting in a waste of memory space in a heterogeneous container

Union-like implementations, such as discriminated unions, made their appearance to overcome some of those limitations. One example of such implementation is boost::variant. Those union-like implementations also come with some limitations:

Waste of space still present
Slight runtime overhead, depending on the specific implementation.

Tuple

A tuple is a finite collection of elements. In C++, the implementation of a tuple is a fixed-size container that can hold elements of any type. In such an implementation, element access is resolved statically resulting in no runtime overhead.

Limitations:

Fixed size
No dynamic access to elements

Conclusion

We reviewed some popular implementations of a heterogeneous container. This list isn't exhaustive, there are other solutions ( boost::any, void*, ...) that share some of the limitations described above. While examining the drawbacks of each solution, we can notice some sort of gap between the tuple and the other solutions:

A tuple is the best in terms of both speed and memory but has very few functionalities and provides no dynamic access. Hence, it has a limited use such as simulating multiple return values.
Other solutions are slower but allow standard and dynamic container operations.

The tek::record container tries to partially fill the gap by providing good performances and a dynamic access to its elements.

tek::record Implementation

Required Technical Background

Before explaining the implementation of the heterogeneous container, the concept of typelist must be briefly reviewed. If you already know what typelists are, you may jump directly to the next section.

Typelist

A typelist is a type containing a certain number of types. Its implementation is based on the following template structure:

C++

template < typename T, typename U >
struct typelist {
  typedef T element_type;
  typedef U next_type;
};

To define a variable number of different types, the typelist structure is declared recursively as follows:

C++

typedef
typelist< int,
          typelist < char,
                 typelist < MyClass,
                        end_of_list > > > MyTypelist;

The above declaration defines a list of three types: int, char, MyClass. The list is terminated by a special structure, end_of_list, that acts as an end marker. At first sight, a simple access to a type is done as follows:

C++

MyTypelist::element_type // int
MyTypelist::next_type::element_type // char
MyTypelist::next_type::next_type::element_type // MyClass

However, to be useful and convenient, accessing a type must be done in a generic way, using an index. The implementation of such access is done recursively: going through the next_type type the number of times given by the index.

C++

template < class T, int INDEX >
struct get_type {
    typedef typename get_type < typename T::next_type,
                                INDEX-1 >::type type;
};
template < class T >
struct get_type < T, 0 > {
    typedef typename T::element_type type;
};

With the recursive get_type structure, we can now access a type using a static index:

C++

get_type < MyTypelist, 0 >::type;
// generates: MyTypelist::element_type

get_type < MyTypelist, 2 >::type;
// generates: MyTypelist::next_type::next_type::element_type

We now know enough about typelists to continue. Note that we left out a number of their capabilities. For further details, check out [Alexandrescu] in the References.

tek::record Static Part

As we saw in the Existing Solutions chapter, tuples have some performance advantages over the other heterogeneous containers. They don't waste memory space¹, and accessing an element is very fast. To achieve such performances, the classical implementation of a tuple relies on a template structure that follows the typelist philosophy.

Example:

C++

template < typename T, typename U >
struct tuple {
    T element;
    U next;

    typedef T element_type;
    typedef U next_type;

};

An arbitrary number of types can be specified to the structure:

C++

tuple< int,
       tuple < std::string,
               tuple < MyClass,
                       end_of_list > > > >

Such declaration will be roughly translated to the following (without including the typedefs):

C++

struct tuple<...> {
    int element;  //first element
    struct tuple<...> {
        std::string element; // second element
        struct tuple<...> {
            MyClass element; // third element
            end_of_list next;
        } next;
    } next;
};

A tuple defines two kinds of indexed access, both of which are performed statically:

Indexed access to an element type
Indexed access to an element value

Indexed access to an element type in the list is identical to the get_type structure used in the typelist implementation we saw earlier. Indexed access to an element value is done using a recursive function:

C++

template< int INDEX, typename return_type >
struct get_element {
    template< class T >
    static return_type Do( T& tup )
    {
        return get_element< INDEX-1, return_type >::Do( tup.next );
    }
};

template< typename return_type >
struct get_element< 0, return_type > {
    template< class T >
    static return_type Do( T& tup )
    {
        return tup.element;
    }
};

Modern compilers are able to inline the recursive function, as if the value was directly accessed. The return_type template parameter is computed by using the get_type structure. Actually, computing the return type is more complex, since we may add or remove reference and constant qualifiers. For example, if the tek::record is declared as constant, we should return a constant.

tek::record Dynamic Part

Introduction

The implementation of a tuple is a good base for a high performance, fixed size heterogeneous container. However, a tuple doesn't provide dynamic access to its elements. That's quite logical: what code an expression like myTuple[x], where x is an index given at runtime, the compiler should generate since the type must be determined at runtime along with x ? It's quite probable that the code will be different if it's an int, a std::string, or an user defined class. Since a simple expression like myTuple[x] can't be evaluated by the compiler, the dynamic access will be done using a function, named apply, which takes two parameters:

A functor, provided by the user, which will be applied to the indexed element
An index

The functor acts as a visitor² and should have an overloaded function for each element type.

Example:

C++

myTuple.apply( Print(),index );

Functor definition:

C++

struct Print {
    void operator()( int element )
    {
        std::cout << "int : " << element << endl;
    }
    void operator()( double element )
    {
        std::cout << "double : " << element << endl;
    }
    void operator()( char element )
    {
        std::cout << "char : " << element << endl;
    }
};

The element parameter will be fed by the apply function. Of course, template functions can be used to factorize the source code if the latter is the same for some or all types.

Implementation

The implementation of such dynamic access must generate every possible code for each type and select the right one according to the runtime value of the index. The selection can be simply done using an if then else chain or a switch case statement.

Example using switch (pseudo-code):

C++

template < class T >
void apply ( int index, T functor )
{
  switch ( index ) {
    case 0 : functor ( tuple[0] ); break; // tuple[0] is an int
    case 1 : functor ( tuple[1] ); break; // tuple[1] is a std::string
    ...
  }
}

Note: tuple[x] is pseudo-code. In reality, we have to use the get_element structure.

This implementation is no more than a kind of type switching and, in the OO world, type switching is considered evil because it introduces coupling: each time you add a new type, you must modify the switch accordingly. However, this drawback isn't one for our implementation since we can automatically adapt the type switching to the types specified to our container. This is indeed done easily if we implement the type switching with cascading if then else. A "recursive" template structure will go through the typelist and add an if condition for each type with the right code. In the end, the compiler generates something like this:

C++

if (index == 0)
  //code for tuple[0]
else
  if (index == 1)
    //code for tuple[1]
  else
    if (index == 2)
      // code for tuple[2]
...

Finding the right index this way induces a complexity of O(n). A search algorithm could be generated instead to reduce the complexity but the performance still depends on the number of types handled by the heterogeneous container. The alternative solution is the use of the switch case statement. While the behavior of such statement is the same as the if then else one for the use we make, its implementation, generated by the compiler, may differ greatly depending on the case values.

For example, suppose we want to test a range of values from 0 to 3. We can test all the values within a switch:

C++

switch (x) {
 case 0 : ...
 case 1 : ...
 case 2 : ...
 case 3 : ...
}

The above switch statement can be optimized by a decent C++ compiler with the construction of a jump table. Such table contains the instruction address of each case statement. The x value is then used as an index to the jump table. Consequently, the code generated by the compiler is just a jump whose operand is the value pointed by an index in the jump table, which gives roughly the following code instruction:

C++

JUMP ADDR_TABLE[ x ];

The performance of a jump table is constant in terms of complexity, and is better than the performance of the if then else approach most of the time ³. To make the construction of a jump table possible, the case values must be contiguous ⁴. Testing every possible index of our heterogeneous container fulfills this condition. Hence, the switch case approach is used in our container to implement dynamic access. However, adapting the number of cases to the typelist given to the record is more difficult (see section Pseudo-variadic Behavior Implications).

In-depth Optimizations Discussions

Bounds Checking Elimination

Since we are in the context of a fixed-size container, out-of-boundary checking can be sometimes superfluous and should be left to the user (see the is_valid_index function in chapter usage). However, the compiler always makes bounds checking to check whether the index is valid for the jump table. Visual C++ proposes the compiler-specific keyword __assume so the programmer can specify that the index is guaranteed to be one of the case values. The compiler can then decide not to make bounds checking.

C++

switch (x) {
 case 0 : ...
 case 1 : ...
 case 2 : ...
 default : __assume(0); // x = 0 , 1 or 2, nothing else
}

I don't know about other compilers supporting these kind of keywords. At the time of writing, GCC doesn't support such compile-time assumptions. So __assume(0) is just used for Visual C++, other compilers will probably check bounds. As you see, the title of this subsection is "Bounds Checking Elimination", which refers to a compiler optimization, for some languages like Java, that eliminates bounds checking when accessing an element in an array, without modifying in any way the program's integrity and behavior. ⁵ Consequently, the apply function should then be used when the index is without any doubt valid. If the index validity can't be certified, the is_valid_index function shall be called before the apply function.

Pseudo-variadic Behavior Implications

Variadic templates isn't supported in the current version of the C++ language, though it is a desirable feature in our case. Users should be able to write with no hassle:

C++

record < int, char > a; // 2 elements
record < int, char, double, std::string > b; // 4 elements

instead of using the recursive way:

C++

record < int, < record char < record double , etc.

To simulate variadic templates, default template parameters are used. The default template argument is null_type, a "dummy type" that contains no data and also acts as an end marker for the typelist.

C++

template < class T0 = null_type,
       class T1 = null_type,
       class T2 = null_type,
       class T3 = null_type >
struct record {};

The user can then specify any number of parameters upto a certain limit, 4 in the example. This limit is ultimately parameterizable with a macro (see chapter Usage). Internally, all those template parameters are converted to a tuple:

C++

tuple< T0, tuple < T1, tuple < T2,  T3 > > >

Using such kind of limit leads to a compilation problem concerning the switch statement in our apply function. Unlike cascading if then else, we can't use a recursive template structure to generate the right number of case statements. Because of this, we have to implement the apply function so that the number of cases will always be equal to the maximal number of types one record can handle. Suppose we set this limit to 4, our apply function will be something like this:

C++

template < class T >
void apply ( int index, T functor )
{
  switch ( index ) {
    case 0 : functor ( tuple[0] ); break;
    case 1 : functor ( tuple[1] ); break;
    case 2 : functor ( tuple[2] ); break;
    case 3 : functor ( tuple[3] );
  }
}

If the user declares a record< int , char >, the last 2 cases will cause compilation errors. Here are two ways of dealing with this problem.

First Approach

We could use a structure that encapsulates each case and generates either functor( tuple[x] ) if x is inferior to the number of types the record handles, and nothing otherwise.

Example:

C++

case 0 : SilentCase< 0 > ( tuple, functor ); // generates functor( tuple[0] );
case 3 : SilentCase< 3 > ( tuple, functor ); // generates nothing

But there are still two performance problems:

The jump table, generated by the compiler, always contains 4 members (in our example) whereas it could contain less depending on the number of types specified to the record. This is a minor but yet existent waste of space.
Because the number of cases could be lower, the compiler may miss some optimizations, since when there are very few cases, a better optimization can be used instead of a jump table. The whole apply function might also be easier to inline.

Second Approach

To overcome the last two problems, we must adapt the number of cases in the switch statement to the number of types handled by the record. But, as said earlier, we can't use a recursive structure that adds the right number of case statements, as with an if then else chain. The trick is to generate every possible apply function, each one with a different number of cases. To select the correct function at compile-time, we wrap all those functions in a structure. The latter accepts an int template parameter that will be used to indicate the right number of cases.

Example (pseudo-code):

C++

template < int I >
struct apply_select;

template <>
struct apply_select < 2 >
{
  static void apply ( int index, T functor )
    {
      switch ( index ) {
        case 0 : functor ( tuple[0] ); break;
        case 1 : functor ( tuple[1] ); break;
      }
};

template <>
struct apply_select < 3 >
{
  static void apply ( int index, T functor )
    {
      switch ( index ) {
        case 0 : functor ( tuple[0] ); break;
        case 1 : functor ( tuple[1] ); break;
        case 2 : functor ( tuple[2] ); break;
      }
};

...

In the record implementation, the right apply function is called this way:

C++

apply_select< record_size >::apply( index, functor );

This way, only the apply function with the right number of cases will be instantiated and generated by the compiler.

Branch Elimination

We saw earlier that a functor is used as an argument to the apply function. This functor can define a specific operator() for each type handled by the record, but sometimes, you want a generic process and use a template operator(). For example, suppose that each type of the heterogeneous container has a data member called data. A generic process can be defined by the following functor:

C++

struct MyFunctor {
  template < class T >
  void operator()( T& element )
  {
    process( element.data );
  }
};

Generic processes are interesting because they may allow the compiler to do an additional optimization concerning the jump table in the apply function. This optimization consists in eliminating branches by merging cases.

For example, consider the following code:

C++

switch (x) {
  case 0 : foo(); break;
  case 1 : foo(); break;
  case 2 : bar(); break;
  case 3 : bar(); break;
  case 4 : bar(); break;
}

We saw that such code will lead the compiler to generate a jump table:

Pseudo generated code:

C++

jump JUMP_TABLE[x]

label_case_0 : call foo ...
label_case_1 : call foo ...
label_case_2 : call bar ...
label_case_3 : call bar ...
label_case_4 : call bar ...

JUMP_TABLE contents:

index	label address
`0`	`&label_case_0`
`1`	`&label_case_1`
`2`	`&label_case_2`
`3`	`&label_case_3`
`4`	`&label_case_4`

A compiler applying branch elimination will generate the following instead:

Pseudo generated code:

C++

jump JUMP_TABLE[x]

label_case_foo : call foo ... // 2 cases merged
label_case_bar : call bar ... // 3 cases merged

JUMP_TABLE contents:

index	label address
`0`	`&label_case_foo`
`1`	`&label_case_foo`
`2`	`&label_case_bar`
`3`	`&label_case_bar`
`4`	`&label_case_bar`

This has numerous advantages because:

It reduces the code size and favors inlining of the apply function.
It reduces the number of branch targets which might lead to a performance increase.
Since it reduces the number of branches, it might allow the compiler to use a better optimization than a jump table.

The template operator() of MyFunctor seems to apply the same code for each element of a record. However, the code generated by the compiler needs to include the address of each element, therefore it may effectively differ.

To understand the issue, let's look at how the apply function works. The latter internally calls another function that takes as an additional parameter the record data. Here is the declaration of this internal function:

C++

template < class T, class U >
apply_internal( T& record_data, int index, U& functor)
{
  switch( index ) {
    ...
  }
}

record_data actually contains all of the elements and has basically the same structure as a tuple. Since we pass the address of a tuple to the apply_internal function, the generated code will reference, for each case, the corresponding element by applying a different offset from the record_data argument:

C++

template < class T, class U >
apply_internal( T& record_data, int index, U& functor)
{
  switch( index ) {
    case 0 : &record_data // generated address of the 1st element
    case 1 : &record_data + 12 // ... 2nd element
    case 2 : &record_data + 26 // ... 3rd element
...
  }
}

But if we pass directly the address of the concerned element instead of the whole tuple, the compiler will have to handle the same reference for each case:

C++

template < class T, class U >
apply_internal( T& element, int index, U& functor)
{
  switch( index ) {
    case 0 : &element
    case 1 : &element
    case 2 : &element
  }
}

To know the address of the element pointed by the index before going to the apply_internal function, we have to store each element address in a void* array, at the construction of the record. Then the apply function loads the address of the element and uses it as a parameter of the apply_internal function:

C++

template < class U >
void apply (int index, U& functor )
{
  apply_internal< record_type >( addrTable[index], x, functor )
}

with apply_internal being defined as follows:

C++

template < class record_type, class U >
void apply_internal( void* element, int index, U& functor)
{
  switch( index ) {
    case 0 : functor ( * (typename get_type< record_type, 0 >::Type *>) // C-style cast
                                    element );
             break;
    case 1 : functor ( * (typename get_type< record_type, 1 >::Type *>)
                                    element );
             break;
  }
}

However, this method has several drawbacks:

It induces overheads: constructing the address table at the beginning and looking up for the element's address for each access
It is triggered in only few contexts, where the code is the same for some or all elements
It can be impossible to know whether the code is the same for some or all types if they aren't known while writing the code. Even if they are known, some misleading cases can occur. For example, at the beginning of the chapter, we defined the functor MyFunctor that does a generic processing on a data member of the element: process(element.data). Even if element.data has the same type for all elements, the data member may have a different offset for each element type, which prevents case merging.

That's why this optimization is disabled by default. It can be enabled by defining the TEK_RECORD_MERGE_APPLY macro (see chapter Usage), but it should be reserved for experimental or profiling purposes.

Usage

This section lists the tek::record functionalities. First, here is a simple example to show the basic usage of a tek::record:

C++

#include "tek/record.h"

#include <iostream>
using namespace std;

// we define a functor whose the job is to increment an element.
struct Inc {
  template < class T >
  void operator()( T& element ) const
  {
    element++;
  }
};

int main()
{
  tek::record< int, char, double > myRecord( 2, 'r', 5.25 );
  cout << myRecord;              // prints: (2 r 5.25)
  cout << myRecord.get<1>();     // prints: r
  myRecord.get<0>() = 10;        // 2 is replaced by 10
  myRecord.for_each( Inc() );    // Increment each element
  cout << "Enter the index of the element you want to increment :"
       << endl;

  int index ;
  cin >> index;

  if ( myRecord.is_valid_index( index ) )
    myRecord.apply( Inc(), index );
  else
    cout << "Invalid index entered";

  return 1;
}

Here is the complete list of operations supported by tek::record. Note that no operation will throw an exception.

Constructors

template < class T0 ,..., class Ti > tek::record()
Constructs a record of uninitialized elements.

template < class T0 ,..., class Ti > tek::record( const T0& a0 ,..., aj )

Constructs a record with the first j elements initialized.

Example:

C++

tek::record< char, int > myRecord( 'e', 3 ) // all elements initialized
tek::record< char, int > myRecord( 'w' ) // the first element is initialized

Member Functions

template< int I > return_type& at()
Returns a reference to the element located at the index I. Since the index is given via a template argument, it cannot be determined at runtime. Use the apply function to use a dynamic index.
template< class T > void for_each( T& f )
Applies the functor f to each element of the record.
template< class T > void apply( T& f, const unsigned int x )
Applies the functor f to the element located at the index x. Note: undefined behavior if x is an incorrect index.
static bool is_valid_index( const unsigned int x )
Returns true if x is a valid index to use with the apply function, false otherwise.
template< class B, class T > void for_interface( T& f )
Applies the functor f to the elements whose type is B, or is derived from the base class B.

I/O Operations

Those operations are very basic and don't use manipulators. They are provided mainly to test the other records functionalities. I/O operations might be fully supported in the future.

template < class T0 ,..., class Ti > std::ostream& operator<<( std::ostream& o, record < T0 ,..., Ti >& )
Prints a record to the output stream in the format (a0 a1 a2 ... ai).
template < class T0 ,..., class Ti > std::istream& operator>>( std::istream& i, record < T0 ,..., Ti >& )
Reads a record from the input stream in the format (a0 a1 a2 ... ai).

Type Information

record< ... >::size

Returns the number of types handled by the record.

Example:

C++

typedef tek::record< int, char, double > MyRecordType;
MyRecordType myRecord;
myRecord.size // = 3
MyRecordType::size // = 3

record< ... >::element::type

Returns one of the types of the record given by the index I.

Example:

C++

typedef tek::record< int, char, double > MyRecordType;
MyRecordType::element< 1 >::type // equivalent to char.

Configuration of tek::record

The record implementation is configurable by three macros located in the file tek/config/record_config.h.

#define TEK_RECORD_MAX_PARAMS
Defines the maximum number of types tek::record can handle
-> defined to 20 by default
#define TEK_RECORD_NO_IO
Disables iostream operators support, prevents the inclusion of the <iostream> header
-> not defined by default
#define TEK_RECORD_MERGE_APPLY
Gives access to the merge_apply function. The merge_apply function has the same interface as the apply function but with a different implementation that might lead to a performance gain if the supplied functor does the same processing for some or all elements. It leads to a performance degradation elsewhere. Defining the macro also degrades the performance of the record constructor and increases the record size.
-> not defined by default

Non-member Equivalents

If the record type is a dependant name, the member functions and structures that need an explicit template argument must be prefixed by the template keyword.

Example:

C++

template < class T>
void f( T& myRecord )
{
  // myRecord.get<1>(); // error on some compilers
  myRecord.template get<1>(); // ok
}

Only compilers that implement the two-phase name look-up, such as GCC, will issue an error if the template keyword isn't present. Such compilers parse the template function f, even if f isn't instantiated. While parsing, the compiler doesn't know the type of myRecord and then doesn't know that get is a template function, so you must put the template keyword so that it can correctly disambiguate the < > (which then will be identified as delimiters for a template argument list, and not as a less than / greater than operator). Some compilers, such as Visual C++ 8, look at the template function only when it is instantiated and, at that time, it knows that get is a template member function, so the template keyword isn't needed. The template keyword is easily forgotten, especially if the program that uses a tek::record needs to be portable across several compilers.

That's why each of the member functions and structures has an equivalent in a non-member way so that it is easier to write portable code. For consistency, all member-functions have a non-member counterpart, they have the same name and the same parameter list with the addition of a record parameter at the beginning of the list:

C++

myRecord.get< 0 >();
tek::get< 0 >( myRecord );

myRecord.for_each( myFunctor() );
tek::for_each( myRecord, myFunctor() );

myRecord.apply( myFunctor(), index );
tek::apply( myRecord, myFunctor(), index );

Functor Definition

Most of the time, the functor used as an argument for the apply function is a temporary object:

C++

tek::apply( myRecord, myFunctor() );

If so, the const qualifier of the operator() function must not be forgotten:

C++

struct MyFunctor {
  template < class T >
  void operator()( T& element ) const
  {
    ....
  }
}

To simulate argument passing, you can use the constructor of the functor:

C++

struct MyFunctor {
  MyFunctor(int arg) : arg_(arg) {}

  template < class T >
  void operator()( T& element ) const
  {
    // some computings with arg_
  }

  int arg_;
}

...

tek::apply( myRecord, myFunctor( 2 ) );

To return a value from the apply function, the functor must derive from the tek::return_type structure, which takes as a template parameter the wanted return type. Every operator() must then return the same type:

C++

struct Inc : public tek:return_type< int > {
  template < class T >
  int operator()( T& element ) const
  {
    ....
  }
}

Benchmark

Tested Solutions

I measured the dynamic dispatching performance of three kinds of heterogeneous containers:

tek::record
Array of boost::variant
Classic polymorphic array using a pointer to a base class

Performance Measurement Methodology

Every benchmark is deceptive in some way. In this section, I try to limit this aspect by providing details about how I proceeded.

First of all, since the performance may vary with the number of elements handled by the container, I benchmarked each container with 2 to 20 elements. Each element type is a different class inheriting from a common abstract base class which declares a pure virtual function named foo. Each class implements the foo virtual function. I measured the performance of the call to foo, which gives different statements for each solution:

polymorphism:
container[index]->foo();
boost:variant:
boost::apply_visitor( Visitor(), container[index] );
tek::record:
container.apply( Functor(), index );

For the variant and record solutions, we need to use a functor, respectively called Visitor and Functor in our example, that implements operator() in exactly the same way for both solutions:

C++

template <class />
void operator()(const T& element) const
{
    element.T::foo();
}

Note: Because, in our case, passing by reference is faster than passing by value, the parameter is const T& element. However, if we use directly element.foo(), the compiler will probably generate a virtual lookup, element being a reference. We then use the form element.T::foo() so that the call to foo will be statically binded.

Class Definition

For each class, the foo function does a few arithmetic operations. Because the size of foo is very small, it will be systematically inlined, if the dispatch mechanism of the solution allows it.

I tested two configurations:

The implementation of foo is different for each class.
The implementation of foo is the same for half of the classes. The merge_apply function can then apply its optimization.

Index Property

The index is determined at runtime. Since the call to foo is executed in a loop, I decided to test two kinds of index:

A stable index that stays the same through all iterations. Note that this stable index is not a constant, it is determined at runtime once and for all.
A random index that changes in each iteration. Random indexes are prefetched in an array, so that the loop just needs to iterate through that array to obtain a random index for each loop.

We don't only use a random index because the performance of using such an index is greatly dependent on the CPU and its branch prediction unit, and important variations can occur. For example, my machine is a Pentium 4 Prescott that has a 31 stages pipeline. A mistaken branch prediction results in a flush of this pipeline, which considerably degrades performance. A stable index eliminates those variations since the taken branch is always the same. Furthermore a stable index doesn't really favor one solution over another since all three solutions use an unconditional jump with a non-immediate branch target, to implement their dispatch mechanism.

Configurations

Finally, in order to reduce the number of graphs, I used only two configurations for measuring performance:

Stable configuration: Stable index + unique implementation of foo for each class
Unstable configuration: Random index + same implementation of foo for half of the classes

System Specification

The benchmarks were performed on a Intel Pentium 4 3Ghz HT, 1 Gb RAM. 2 "platforms" were tested:

Platform Visual C++:
OS: Windows XP sp2
Compiler: Visual C++ 8.0
Optimization flags: /Ox /GL
Platform GCC:
OS: gentoo linux 2.6.19
Compiler: GCC 4.1.1
Optimization flags: -O3

Benchmarks

Stable Configuration

Platform Visual C++

The boost::variant solution is constantly the slowest solution, because Visual C++ isn't able to inline the boost::apply_visitor function. The latter uses a switch statement that generates a jump table, as the tek::record solution, but doesn't adapt the number of cases (see section Pseudo-variadic Behavior Implications). Consequently, it always has 20 cases, which corresponds to the default maximal number of types a boost:variant can handle.

The tek::apply solution is the best until 9 types are handled. At 9 types, the Visual C++ compiler considers that the tek::apply function becomes too big and chooses not to inline it any longer, leading to performance degradation. At that time, virtual calls are the best performing solution by a small margin.

Platform GCC

As a general tendency, GCC inlines more aggressively than Visual C++ with the given optimization flags. The boost::variant solution has some performance problem when a few number of types are involved. In fact, as we saw, boost::variant always uses 20 cases switch statement, but the unused cases perform some default processes which have a larger code size than the virtual function foo. So the less the cases are used, the larger the boost::apply_visitor code size will be. Therefore, up to 5 elements, GCC doesn't inline the apply_visitor function. The tek::apply function is always inlined. From 2 to 4 types, GCC doesn't use a jump table to implement the switch case statement, preferring an if then else approach. However, I must say that using this approach leads to an additional optimization, which is a side-effect of having a stable index. Indeed, the tek::apply function is executed in a loop, and when an if then else approach is used, the algorithm looks like this:

C++

for ( ... )
{
  if (stable_index == 0)
   apply_to_the_first_element
  else
   apply_to_the_second_element
  ...
}

However the if condition is invariant since stable_index itself is an invariant. Therefore, GCC is able to transform the code to the following:

C++

if (stable_index == 0)
  for ( ... ) {
   apply_to_the_first_element
  }
else
  for ( ... ) {
   apply_to_the_second_element
  }
...

Even without this optimization that further increases the performance until four types are handled, tek::record is always the best solution here.

Unstable Configuration

Platform Visual C++

As expected, the performances of each solution are unstable with a random index. Because the merge_apply function merges some cases, it can be inline longer than the apply function resulting in a performance improvement over the latter from 8 types to 12 types.

Platform GCC

The code generated by GCC for the tek::merge_apply function includes case merging, but doesn't seem to be efficient for my system. It is slower than boost::variant which also allows branch elimination via case merging (because the boost::apply_visitor accepts as an argument the indexed element). The tek::apply function is overall the best here.

Conclusion

Concerning dynamic dispatch performances, tek::record is faster than a container of boost::variant in about every tested context. This is mainly because of some implementation details, as both use a switch case statement, but, intrinsically tek::record should also be a little faster than a container of discriminated unions, because it has a direct access to its elements. Comparing to the virtual call mechanism implemented by the tested compilers, tek::record is faster most of the time, except with Visual C++ on a megamorphic⁶ situation. However the benchmarks didn't test every possible situation that may favor one solution over the others. Particularly, virtual functions of large size weren't considered which may sometimes favor basic virtual calls if it prevents inlining. The implementation of virtual calls may also be optimized depending on the compiler, but such optimizations are very scarce in the current world of C++ compilers. However, although these are limited by their static nature, profile-guided compilations allow new optimizations in this area such as virtual call speculation⁷. The benchmarks just measured the performance of basic implementations of virtual calls.

General Conclusion

tek::record is a fixed size heterogeneous container that provides quite limited functionalities. However, it exhibits the following features simultaneously:

High performance dynamic dispatch
Low memory space consumption
Type identity conservation

References

[Alexandrescu]: "Modern C++ Design: Generic Programming and Design Patterns Applied", Andrei Alexandrescu - Addison Wesley, 2001

Notes

[1] ^ Without considering data structure padding

[2] ^ The visitor pattern

[3] ^ If there are very few cases in the switch, jump tables aren't the fastest implementation any longer: the compiler may use another approach to implement a switch case in this situation.

[4] ^ Some compilers are more or less tolerant towards how contiguous the values must be.

[5] ^ For instance, if the index, used to access an element in some array, is computed earlier in such a way that it will be mathematically a valid index under any circumstance, the compiler can then apply bounds checking elimination and use the index directly: it doesn't affect the program's correctness.

[6] ^ Megamorphism = polymorphism with a high number of types.

[7] ^ Virtual call speculation is one of the profile guided optimizations proposed by Visual C++ 8.

History

24-02-2008:
- License change from GPL to CPOL
- [Article] small explanation added about the inlining policy of Visual C++ ( section Platform Visual C++ )
- [Article] correction in the get_type code ( section Typelist )
- [Article] correction in the functor code to illustrate the return_type feature ( section Functor Definition )
30-01-2008: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

High Performance Heterogeneous Container

Contents

License