An Ultrafast Light Timeseries Storage Engine – LMDB Part 1

Espen Harlinn

4.98/5 (23 votes)

23 Mar 2021CPOL29 min read

34.9K

706

A lightweight timeseries storage engine, capable of storing millions of timeseries values per second

This article presents a new ultrafast timeseries storage engine. The solution includes several examples that write 1 000 000 000 timeseries values to the timeseries storage demonstrating the performance for various workloads.

Download the updated source code for this article (23^rd March, 2021) - 1.9 MB

Introduction

A timeseries is a table indexed in time order, and the information contained within timeseries tables is the main ingredient to many machine learning algorithms.

The information within a timeseries is anything that changes over time, such as:

Stock prices and volume
Precipitation

By analyzing stock prices, you can attempt to predict future stock prices; and by analyzing precipitation, you can make qualified guesses about future floodings or droughts, or the price of electricity if the main source of electricity is hydro power.

Ignoring cat pictures, there is probably more data stored as timeseries than any other kind of data.

In their book, Time Series Databases: New Ways to Store and Access Data, Ellen Friedman and Ted Dunning make a good argument for using a noSQL Database with a hybrid design to implement a high-performance timeseries database engine, using Open TSDB as a reference case.

The concept is simple: Use blobs to store the timeseries data.

My intention for this article, and the accompanying source code, is to show that the concept works – and that it can be implemented in easily understood C++. You can easily follow the code for this article if you have a background as a C# developer, deep C++ knowledge is not required.

Building the Code

The boost C++ libraries are required to build and run the code. The provided Visual Studio projects expect the environment variable BOOST_ROOT to point to the directory where you unpacked the distribution, and that the library files can be found in $(BOOST_ROOT)\stage\lib.

Make sure you compile for x64, and if you plan to test the code on a system that does not support AVX2, you need to change the Enable Enhanced Instruction Set setting under C/C++ ⇒ Code Generation to a value suitable for your system.

The performance tests use an environment variable HCC_TEST_DATA_ROOT, and this must be set to the full path of a directory where the tests can create a subdirectory for the database files. Make sure there is at least 40 GB of free space on the drive containing this directory.

Performance

Since I call this as an ultrafast timeseries storage, I should have some numbers to back that claim.

So, before going into the details of the solution, I would like to present the results of some performance tests that I have used to get an idea about how well this works. The source code for the programs are included with the download for this article, and when evaluating the results: Please keep in mind that these tests were executed on a laptop, not a high-performance server.

The timings are from before writing the first timeseries point until the last one is written, and the transaction is committed to the database file. Reading is done in a new transaction scope.

One of the programs, HExPerf01.exe, writes one billion timeseries values to a single timeseries, and then reads back all the data:

Database directory:F:\Database\LMDB
Wrote 1000000000 timeseries points in 43.384699 seconds - points pr. second: 23049600.966461
Read  1000000000 timeseries points in 1.232314 seconds - points pr. second: 811481423.445532
Sum inserted:500000000067108992.000000,
Sum read:    500000000067108992.000000

This is certainly promising, as the program

inserted more than 23 million timeseries points per second
read more than 811 million timeseries points per second

1 000 000 000 timeseries points is enough data to demonstrate that the timeseries storage engine is something more than a toy.

The timeseries point is a record with three fields:

Timestamp: DateTime
Flags: Int64
Value: double

The size of each record is 24 bytes, so the size of 1 000 000 000 timeseries points is nearly 24 GB.

Sum inserted is the sum of all the values for the timeseries points written to the storage, and Sum read is the sum of all the values for the timeseries points read from the storage. The identical sums verify that we read back the same data that was written.

Run it again, and it writes the timeseries points even faster:

Database directory:F:\Database\LMDB
Wrote 1000000000 timeseries points in 23.693139 seconds - points pr. second: 42206310.574444
Read  1000000000 timeseries points in 1.181736 seconds - points pr. second: 846212833.697684
Sum inserted:500000000067108992.000000,
Sum read:    500000000067108992.000000

This time, the program

inserted more than 42 million timeseries points per second
read more than 846 million timeseries points per second

The write performance went up because the timeseries storage engine reuses the storage file created during the first run.

The performance looks good, but this is hardly a normal use case for a timeseries storage engine.

Over the years, I have looked at several timeseries engine benchmarks, and they usually write timeseries points in batches. HExPerf03.exe writes one billion timeseries values, spreading the data over 10 000 timeseries, writing the timeseries points to the timeseries in batches of 250 values:

C++

for ( size_t i = 0; i < BatchCount; ++i )
{
    for ( auto& timeseriesId : timeseriesIds )
    {
        timeseriesCursor1.ChangeTimeseries( timeseriesId );
        for ( size_t j = 0; j < BatchSize; ++j )
        {
            Int64 value = static_cast<Int64>( ( i * BatchSize ) + j ) + 1;
            sumWritten += value;
            timeseriesCursor1.Insert( Point( DateTime( value ), 
                                        0, static_cast<double>( value ) ) );
        }
    }
}

Reading, and calculating the simple checksum for the data:

C++

size_t totalRows = 0;
for ( auto& timeseriesId : timeseriesIds )
{
    timeseriesCursor2.ChangeTimeseries( timeseriesId );
    totalRows += timeseriesCursor2.ForEach( []( const Point& point, double& sumRead )
    {
        sumRead += point.Value( );
    }, sumRead );
}

Output:

Database directory:F:\Database\LMDB
Inserted 1000000000 timeseries points into 10000 timeseries in 27.847767 seconds
         - points pr. second: 35909521.937612
Read  1000000000 timeseries points from 10000 timeseries in 2.563846 seconds
         - points pr. second: 390039103.370308
Sum inserted:50000500000000.000000,
Sum read:    50000500000000.000000

As expected, the performance dropped, but the timeseries engine is still able to

insert more than 35 million timeseries points per second
read more than 390 million timeseries points per second

The most interesting test for a timeseries storage engine is how well it handles writing one value at the time to the timeseries storage. If you have 10 000 sensors reporting changes at the same time, you get a very different workload from the ones demonstrated so far. Like the previous program, HExPerf04.exe writes 1 000 000 000 values spreading the data across 10 000 timeseries, but writes one value to a timeseries before moving on to the next one:

C++

for ( size_t i = 0; i < NumberOfPoints; ++i )
{
    for ( auto& timeseriesId : timeseriesIds )
    {
        timeseriesCursor1.ChangeTimeseries( timeseriesId );
        Int64 value = static_cast<Int64>( i + 1 );
        sumWritten += value;
        timeseriesCursor1.Insert( DateTime( value ), 0, static_cast<double>( value ) );
    }
}

Output:

Database directory:F:\Database\LMDB
Inserted 1000000000 timeseries points into 10000 timeseries in 76.781371 seconds
         - points pr. second: 13023992.431581
Read  1000000000 timeseries points from 10000 timeseries in 2.515501 seconds
         - points pr. second: 397535075.811728

Again, the performance dropped, and this program

inserted more than 13 million timeseries points per second
read more than 397 million timeseries points per second

Note that the read performance varies greatly, and for this workload, it strays between 300 and 400 million timeseries points per second, while the write performance is fairly stable. This is probably because the read performance relies heavily on the cache mechanisms provided by Windows for memory mapped IO, which tries to balance the needs of all the applications running on the system.

Since each of the tests above wrote all the data just before reading it back again, the cache is hot, and this is the main reason for the rather extraordinary read performance. To demonstrate the read performance of the timeseries storage engine when the cache is cold, HExPerf05.exe opens an existing storage and reads all the timeseries points once:

Database directory:F:\Database\LMDB
Read  1000000000 timeseries points from 10000 timeseries in 10.476408 seconds
         - points pr. second: 95452568.067823
Sum read:    50000500000000.000000

As expected, the read performance dropped, but it is still exceptional: reading more than 95 million timeseries points per second with little benefit from cached information.

To put these numbers into perspective: There were 24 441 833 trades on Nasdaq on October the 1^st, 2020 – and the timeseries storage engine can store 1 000 000 000 trades in 96 seconds when the record for each trade contains:

Timestamp: DateTime
BidId: 64-bit bid identifier
OfferId: 64-bit offer identifier
Volume: 64-bit floating point number for the traded volume
Price: Currency 64-bit fixed point for price per volume unit

Each record is 40 bytes, and the timeseries storage engine stored close to 40 Gb in 96 seconds. So, if this were the information stored for each trade, it would be able to store all the trades for one trading day in less than 3 seconds.

The HExPerf06 project also demonstrates how easy it is to use a custom timeseries point class with the timeseries storage engine.

If you try out these tests on your own system, be aware that your anti-virus solution may have a negative impact on the performance of the tests. Performance should still be outstanding, but not as good as they will be if you turn off the anti-virus monitoring for the folder containing the datafiles.

I guess Ellen and Ted knew what they were writing about, and of course, the timeseries storage engine takes advantage of the nature of timeseries data, and it is optimized for the most common usage patterns.

Timeseries Data

An efficient timeseries storage engine processes data differently from a regular database engine since

data is mostly appended to a timeseries
updates to existing data are comparatively rare
the record for a timeseries point usually have a fixed size
large volumes of data, sometimes billions of records, are read and written each day

A typical windmill used for power production has multiple eddy current and displacement sensors, accelerometers, and wind and temperatures sensors; all continuously reporting information. The information needs to be analyzed in real-time, often using complex operations on timeseries to detect changes that are critical to safe and optimal operation of a wind park:

Trend analysis can be used to compare values from normal operating conditions with the current values.
Time synchronous averaging is commonly used for gear condition monitoring, resampling the vibration data synchronously with shaft rotation in order to extract periodic waveforms from noisy data. This technique can be used to identity rotating bearing or gearbox defects.
Amplitude demodulation is used to detect defects that produce impacting, such as rolling contacts in bearings and tooth-to-tooth contacts in the gear meshes.
Fast-Fourier transform, and spectrum analysis makes it possible to distinguish between normal rotation and characteristic defect frequencies.

These are just examples of the kind of analysis that can make use of large amounts of high resolution timeseries information, where much more data will be read than written. I think this makes a good case for a timeseries storage engine with extraordinary read performance and good write performance.

The Timeseries Engine

Using an existing noSQL database engine simplifies things a lot, but it also means that the design must be tailored to the strengths and weaknesses of that engine.

The design, I am going to present here, heavily favors reading over writing, and I chose to use the Lightning Memory-Mapped Database, https://symas.com/lmdb/, (LMDB) for the blob storage because it is small enough to be included with the source code for the article.

LMDB is a remarkably compact noSQL storage engine, and developing for this engine is probably a bit like driving a formula 1 racing car: Use it right and you will get blazing performance, but this can drop rather quickly if you don’t play to its strengths.

The timeseries storage engine is implemented as a set of C++ classes and templates, which allows you to specify your own type for each timeseries point, and the number of timeseries points that can be stored inside a single blob. The only requirement for a timeseries point is that it must satisfy this concept:

C++

template<typename T>
concept PointType = requires(T t)
{
    { t.Timestamp() } -> std::convertible_to<DateTime>;
    std::is_default_constructible_v<T>;
    std::is_standard_layout_v<T>;
};

The library provides one class, TimeseriesPoint, which satisfies these requirements, and this class will be used throughout the series. It has only three data members:

C++

class TimeseriesPoint
{
    DateTime timestamp_;
    Int64 flags_;
    double value_;
public:
    ...
};

Timestamps are stored as 64-bit integers, compatible with .NET System.DateTime ticks in UTC.

This is a fairly common representation of a timeseries point used to store measurement data, end the initial storage overhead for the B+ tree is just above 1%.

The strategy for the blob storage is simple:

Timeseries points are bundled together in fixed size segments which are stored in the blobs.
A timeseries is identified by a Guid.
The key for a segment in the database has two fields:
- the Guid for the timeseries
- the timestamp of the first timeseries point in the segment
All segments for a timeseries, except the last one, are completely filled with data.
Empty segments will be deleted, there should be no empty segments in the database.
Each segment holds data for an interval, and data for that interval can only be present in that segment.

Timeseries points can be unevenly distributed over time, which is good for solutions that mainly handle events that happen at irregular intervals.

Since Guid is used extensively by the project, I have given the class a workover, and the Guid class now relies on SSE 4.1. Comparison has a slight performance edge over boost::uuids::uuid. There is no good reason for this as they do the same thing for basic comparison. It is when Guid is used as a key with std::unordered_map<K,V> that its strength becomes apparent as it makes searches in the map more than twice as fast as using std::unorderd_map<> with boost::uuids::uuid and boost::hash<boost::uuids::uuid>.

The value of a Guid created by CoCreateGuid is fairly random, so there is really no need to perform any calculation in the specialization of std::hash<> to get a good distribution for hash keys:

C++

namespace std
{
  template<> struct hash<Harlinn::Common::Core::Guid>
  {
    constexpr std::size_t operator()( const Harlinn::Common::Core::Guid& guid ) const noexcept
    {
      return guid.Lo( );
    }
  };
}

Since a Guid is just a 16 byte value, guid.Lo( ) returns the 8 bytes starting at the 9^th byte of the Guid, and this strategy works very well with std::unorderd_map<> where the performance of the hash function plays a major role.

Using the Engine

Using the timeseries storage engine is easy:

C++

int main()
{
    using Engine = Timeseries::Engine<>;
    using Point = typename Engine::Point;
    using TimeseriesCursor = typename Engine::TimeseriesCursor;
    DateTime timestamp1( 2020, 1, 1 ), timestamp2( 2020, 1, 2 ),
        timestamp3( 2020, 1, 3 ), timestamp4( 2020, 1, 4 );

First, we just accept the default parameters for the Timeseries::Engine template, that specifies that TimeseriesPoint will be used as the representation for timeseries points and that the segments will have room for 8100 entries, and then create a couple of aliases for the timeseries point type and the timeseries cursor. The four timestamps hardly need an explanation.

C++

auto DatabaseDir = GetDatabaseDir( );
printf( "Database directory:%s\n", DatabaseDir.c_str( ) );
TimeseriesEngineOptions options( DatabaseDir, true );
auto timeseriesId1 = Test::Ids[0];

GetDatabaseDir() retrieves the value of the HCC_TEST_DATA_ROOT environment variable, appends "\\LMDB" and makes sure the resulting path exists. The second parameter to the TimeseriesEngineOptions constructor specifies whether we want to create a new environment or not. Test::Ids is just an array of 10 000 predefined GUIDs, and we pick the first one to identify the timeseries.

C++

Engine engine( options );
auto transaction = engine.BeginTransaction( );
auto timeseriesCursor = transaction.OpenTimeseries( timeseriesId1 );

The engine constructor initializes LMDB, and now the engine is ready for use. engine.BeginTransaction( ); creates a new write transaction, and we open a TimeseriesCursor in that transaction using OpenTimeseries(…) passing the Guid identifying the timeseries as the argument. Finally, we are ready to insert some data:

C++

timeseriesCursor.Insert( timestamp1, 1 );
timeseriesCursor.Insert( timestamp2, 2 );
timeseriesCursor.Insert( timestamp3, 3 );
timeseriesCursor.Insert( timestamp4, 4 );

And then, we iterate over the timeseries:

C++

if ( timeseriesCursor.MoveFirst( ) )
{
    do
    {
        auto& current = timeseriesCursor.Current( );
        std::cout << "Timestamp: " << current.Timestamp( )
            << ", Flags: " << current.Flags( )
            << ", Value: " << current.Value( ) << std::endl;
    } while ( timeseriesCursor.MoveNext( ) );
}

Searching is a bit more interesting:

C++

DateTime timestamp( 2020, 1, 2, 12, 0, 0 );

There are no timeseries points in the storage with this timestamp, but it is still relevant to be able to use this timestamp to search for the previous timeseries point:

C++

auto found = timeseriesCursor.Search( timestamp );
if ( found )
{
    auto& current = timeseriesCursor.Current( );
    std::cout << "Found Timestamp: " << current.Timestamp( )
        << ", Flags: " << current.Flags( )
        << ", Value: " << current.Value( ) << std::endl;
}

At this point, we are done with what we want to do, so it is time to clean things up:

C++

    timeseriesCursor.Close( );
    transaction.Commit( );
}

Neither of the last two lines of code are strictly needed, but without the commit, the destructor will roll back any changes made to the timeseries storage.

Database directory:F:\Database\LMDB
Timestamp: 01.01.2020 00:00:00, Flags: 0, Value: 1
Timestamp: 02.01.2020 00:00:00, Flags: 0, Value: 2
Timestamp: 03.01.2020 00:00:00, Flags: 0, Value: 3
Timestamp: 04.01.2020 00:00:00, Flags: 0, Value: 4
Found Timestamp: 02.01.2020 00:00:00, Flags: 0, Value: 2

This covers the most important operations for a timeseries storage engine. The tricky part is to make sure these operations can be performed quickly, everything else is ancillary.

The timeseries engine has three main parts:

the Engine, Transaction and TimeseriesCursor C++ template classes
the Segment and SegmentKey C++ template classes
the LMDB C++ classes

The Engine, Transaction and TimeseriesCursor C++ Template Classes

The Engine C++ template class is the entry point for working with the timeseries storage engine. By default, the size of a segment, which is the maximum number of timeseries points that can be written to a blob, is set to 8100, and the type used to represent a single timeseries point is TimeseriesPoint.

The constructor takes a single argument, a reference to an EngineOptions object. EngineOptions lets you specify the directory that contains the LMDB datafiles, and whether to create a new storage or open an existing one. Apart from the constructor and destructor, the Engine template exposes a single function:

C++

Transaction<TP, segmentSize>
    BeginTransaction( TransactionFlags transactionFlags = TransactionFlags::None ) const
{
    auto lmdbTransaction = environment_.BeginTransaction( transactionFlags );
    Transaction<TP, segmentSize> result( const_cast<Engine*>(this), 
                                           std::move( lmdbTransaction ) );
    return result;
}

BeginTransaction(…) creates a new Transaction object, which wraps an LMDB::Transaction object and caches information on behalf of the cursors.

This cache is implemented as an instance of:

C++

std::unordered_map<Guid, std::unique_ptr<TimeseriesInfoType>> timeseriesMap_;

where TimeseriesInfoType is an instantiation of the following template:

C++

template<Segments::PointType TP, size_t segmentSize>
class TimeseriesInfo
{
public:
    using SegmentType = Segments::Segment< TP, segmentSize>;
private:
    DateTime minTimestamp_;
    DateTime maxTimestamp_;
    SegmentType modificationBuffer_;
    size_t changes_ = 0;
    std::optional<DateTime> lastSegmentTimestamp_;
    DateTime loadedTimestamp_;
    Guid id_;
public:
    ...
};

Here, we keep a few details about each of the timeseries, such as the timestamp of the first, and the last, timeseries point in the timeseries. Then, we have the modification buffer and its change count, and the timestamp of the last segment stored in LMDB. loadedTimestamp_ contains the value of the timestamp for the first timeseries point in the modification buffer, at the time it was loaded.

This is enough information to maintain a cache for updates to the storage, boosting the write performance significantly. Commit flushes any buffered modification to storage:

C++

void Commit( )
{
    SaveCachedUpdates( );
    lmdbTransaction_.Commit( );
    timeseriesMap_.clear( );
}

and SaveCachedUpdates( ) just iterates over entries in the timeseriesMap_:

C++

void SaveCachedUpdates( )
{
    auto database = TimeseriesDataTable( );
    for ( auto& entry : timeseriesMap_ )
    {
        TimeseriesInfoType* timeseriesInfo = entry.second.get( );
        if ( timeseriesInfo->Changes( ) )
        {
            timeseriesInfo->SetChanges( 0 );
            auto modificationBuffer = timeseriesInfo->ModificationBuffer( );
            const auto& timeseries = timeseriesInfo->Id( );
            auto loadedTimestamp = timeseriesInfo->LoadedTimestamp( );
            auto& first = modificationBuffer->front( );
            if ( loadedTimestamp && ( loadedTimestamp != first.Timestamp( ) ) )
            {
                // The key changed, so delete previously stored segment
                KeyData deleteKey( timeseries, loadedTimestamp );
                lmdbTransaction_.Delete( database, deleteKey );
            }
            KeyData keyData( timeseries, first );
            lmdbTransaction_.Write( database, keyData, *modificationBuffer );
            timeseriesInfo->SetLoadedTimestamp( first.Timestamp( ) );
        }
    }
}

When writing a segment to the storage, there is one thing that must be handled properly: The first timeseries point may not have the same timestamp as the first timestamp in the modification buffer had when it was loaded. If this is the case, then we must delete original key/value pair from the storage before storing the modification buffer and updating the loadedTimestamp_ field of the TimeseriesInfo object. Implementing a write cache does not have to be complicated, it just has to be efficient.

The TimeseriesCursor class relies on the GetTimeseriesInfo(…) function of the Transaction class to retrieve the TimeseriesInfo for a timeseries:

C++

TimeseriesInfoType* GetTimeseriesInfo( const Guid& timeseries )
{
    auto it = timeseriesMap_.find( timeseries );
    if ( it != timeseriesMap_.end( ) )
    {
        return it->second.get( );
    }
    else
    {
        auto timeseriesInfo = std::make_unique<TimeseriesInfoType>( timeseries );
        auto* result = timeseriesInfo.get( );
        timeseriesMap_.emplace( timeseries, std::move( timeseriesInfo ) );
        return result;
    }
}

and if this is the first time the function is called for a particular timeseries, it creates a new instance and transfers ownership of the object to the timeseriesMap_.

TimeseriesCursor has few fields on its own:

C++

template<Segments::PointType TP = TimeseriesPoint,
      size_t segmentSize = TimeseriesCursorSegmentSize>
class TimeseriesCursor
    : public Segments::SegmentContainer<
                  TimeseriesCursor<TP, segmentSize>, Guid, TP, segmentSize>
{
public:
    static constexpr size_t SegmentSize = segmentSize;
    using Base = Segments::SegmentContainer<
                  TimeseriesCursor<TP, segmentSize>, Guid, TP, segmentSize>;
    using Point = typename Base::Point;
    using Overflow = std::optional<Point>;
    using SegmentType = typename Base::DataType;
    using KeyData = typename Base::KeyType;
    using TransactionType = Transaction<TP, segmentSize>;
private:
    using TimeseriesInfoType = Internal::TimeseriesInfo<TP, segmentSize>;
    size_t position_ = 0;
    TimeseriesCursorState state_ = TimeseriesCursorState::Unknown;
    TimeseriesInfoType* timeseriesInfo_;
    SegmentType* currentSegmentData_;
    LMDB::Cursor cursor_;
    TransactionType* transaction_;
public:
    ...
};

position_ is currently 1 based, which was probably not a good idea, as I have to subtract 1 all the time to get to the current timeseries point within the modification buffer, or the segment pointed to by currentSegmentData_. Which one is determined by the state_ of the cursor. currentSegmentData_ usually points to memory managed by LMDB, and may point to memory that is mapped to the datafile.

The TimeseriesCursor class implements operations for searching and navigating through the data for a timeseries.

const Point& Current( ) const noexcept: returns a reference to the timeseries point at the current position of the cursor
void Flush( ): flushes the modification buffer to storage
void Insert( const Point& point ): inserts or overwrites a timeseries point in the timeseries
SearchResult Search(const DateTime& timestamp ): searches the timeseries for a timeseries point with a timestamp less or equal to the given timestamp
bool MoveFirst( ): moves the cursor to the first timeseries point in the timeseries
bool MoveLast( ): moves the cursor to the last timeseries point in the timeseries
bool MoveNext( ): positions the cursor on the next timeseries point in the timeseries
bool MovePrevious( ): positions the cursor on the previous timeseries point in the timeseries
size_t ForEach( Func&& func, Args&& ...args ): provides fast access to every timeseries point in the timeseries
size_t ForEach( const DateTime& start, Func&& func, Args&& ...args ): provides fast access to the timeseries points in the timeseries starting with the timeseries point with timestamp equal to start, or the timeseries point with the greatest timestamp less than start if there is none with an equal timestamp
size_t ForEach( const DateTime& start, const DateTime& end, Func&& func, Args&& ...args ): similar to the previous overload, but stops the iteration at the last timeseries point with a timestamp less than end

One of the complicating factors for the implementation of the TimeseriesCursor is that searching using LMDB places the LMDB cursor at the position that either matches the search key, or places the cursor on the next entry in the B+ tree. The downside of this is that when this mechanism is used to locate a timeseries point, the search will nearly always place the cursor on the next key/value pair.

The key for a segment has two parts:

the Guid identifying the timeseries
the timestamp for the first timeseries point in the segment

This means that if the timestamp is earlier than any timestamp for a timeseries point in the timeseries, simply moving to the previous entry is not good enough, as in this case, that will put the cursor on an entry with a segment of timeseries points belonging to another timeseries. In this case, the search has to move back to the entry located by LMDB and set the compareResult_ field of the SearchResult to CompareResult::Greater, indicating that the cursor is positioned on the first timeseries point of the timeseries with a greater timestamp than the one that was passed to Search(…).

Since a timeseries point can be inserted anywhere in a timeseries, the solution must be able to handle situations where it will replace the first timeseries point in a segment. In this case, it must look at the end of the previous segment and determine that the timestamp of the new timeseries point is greater than the timestamp of the last timeseries point in the previous segment. If this is the case, the cursor is moved to the segment located by the search functionality of LMDB and the timeseries point is inserted in front of the segment. This causes the key for the segment to change, and the entry with the old key must be deleted when flushing the changes to the storage.

The LMDB C++ Classes

The LMDB is a compact implementation of a transactional key/value store, and HCCLMDB.h contains a set of classes that wrap the relevant parts of the LMDB C API. The Visual Studio 2019 solution, provided with the source code for this article, includes the libLMDB project which creates a DLL for LMDB.

In LMDB terminology, an environment provides access to a data file and an accompanying lock file, while a database provides access to a key/value storage stored in the data file of the environment. It is tempting to call this a table, and multiple named databases can reside in one environment.

Nearly all LMDB operations use transactions, and LMDB provides support for two kinds of transactions, one for reading and one for writing. Readers do not block writers, and writers do not block readers. An environment can be opened by multiple processes running on the same machine, and each environment supports one concurrent write transaction.

By default, keys are ordered lexicographically, but you can supply your own comparison function.

Named Databases

To open more than one database in an environment, each must be named, and the maximum number of databases that can be opened in the environment must be specified. This must be done by the first process or thread creating or opening the environment.

This is handled transparently by the LMDB::Environment constructor, which takes the maximum number of databases as its third argument.

C++

explicit Environment( const char* path,
    size_t newMemoryMapSize = 10485760,
    unsigned int maxDatabases = 0,
    EnvironmentFlags environmentFlags = EnvironmentFlags::WriteMemMap );

The LMDB C++ classes make it easier to work with LMDB, primarily by wrapping the handle types from the LMDB C API using a C++ class for each handle type:

Environment
Transaction
Database
Cursor

Errors reposted by the LMDB C API are turned into C++ exceptions, and the library tries to provide sensible default values for many function parameters.

Environment

The application must create an LMDB::Environment object before any work can be performed using LMDB.

An environment contains at most one anonymous database, or it can contain multiple named databases, residing in a single memory mapped file.

The simplest way to create a new LMDB::Environment object only requires the path to the directory that contains, or will contain, the database files:

C++

LMDB::Environment environment( DatabaseDir );

Transaction

All database operations require a transaction, and transactions can be read-write or read-only.

Write transactions may not span threads. A new read-write transaction is created by calling:

C++

auto transaction = environment.BeginTransaction( );

While a read-only transaction is created with the following statement:

C++

transaction = environment.BeginTransaction( LMDB::TransactionFlags::ReadOnly );

The changes performed in a transaction are committed to the database using:

C++

transaction.Commit( );

while:

C++

transaction.Abort( );

is used to roll back all the changes made in the transaction. The destructor of a Transaction object will also roll back any changes not committed to the environment.

Before you can make any changes to database, it must be opened:

C++

auto database = transaction.OpenDatabase( );

and once a database is opened, it can be modified:

C++

transaction.Write( database, 1, 1);

or you can open a cursor on the database in a transaction:

C++

auto cursor = transaction.OpenCursor( database );

Database

The LMDB::Database class wraps a database handle in an LMDB environment.

A database is opened using:

C++

class Transaction
{
  ...
public:
  Database OpenDatabase( const char* name, DatabaseFlags databaseFlags = DatabaseFlags::None );
  Database OpenDatabase( DatabaseFlags databaseFlags = DatabaseFlags::None )
  ...
};

The database handle will be private to the current transaction until the transaction is committed. After a successful commit, the handle resides in the environment, ready for use by other transactions.

Transaction::OpenDatabase(…) must not be called from multiple concurrent transactions in the same process, and the transaction must either commit or abort before any other transaction in the process can call Transaction::OpenDatabase(…). If the transaction is aborted, the database handle will be closed automatically.

Transaction::OpenDatabase(…) returns the existing database handle when it is called for a database that is already open in the environment. Database handles can only be closed once by calling Environment::CloseDatabase(…).

Cursor

LMDB::Cursor objects provides functionality that can be used to navigate, search and modify key/value pairs in the database:

constexpr bool IsPositioned( ) const noexcept: returns true if the cursor is positioned on a key/value pair
const LMDB::Value& Key( ) const noexcept: returns a reference to the current key
template<typename T> const T& Key( ) const noexcept: template used to cast the contents of the key to the specified type
bool SetKey( const LMDB::Value& key ): Moves the cursor to the specified key, returns false if the specified key/value pair does not exist
const LMDB::Value& Value( ) const noexcept: returns a reference to the current value
template<typename T> const T& Value( ) const noexcept: template used to cast the contents of the current value to the specified type
bool SetValue( const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None ): updates the value at the current cursor position
bool Write( const LMDB::Value& key, const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None ): stores a key/value pair in the database
bool Write( const LMDB::Value& key, const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None ): Write a key/value pair to the database using the cursor
template<ValueType T1, ValueType T2> bool Write( const T1& key, const T2& value, WriteFlags writeFlags = WriteFlags::None ): This template creates and initializes LMDB::Value objects for the key and value, simplifying the API.
bool Write( const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None ): Updates the value at the current cursor position
template<ValueType T> bool Write( const T& value, WriteFlags writeFlags = WriteFlags::None ): Simplifies updating the value at the current cursor position by initializing the LMDB::Value object for the argument value
bool Search( const LMDB::Value& key ): searches the database for a key/value pair with a key equal to, or greater than the specified search key
template<ValueType T> bool Search( const T& key ): simplifies the Search(…) API by creating and initializing an LMDB::Value object for the key
bool MoveTo( const LMDB::Value& key ): moves the cursor to the key/value pair for the argument key, returns false if the key does not exist
template<ValueType T> bool MoveTo(const T& key ): simplifies the MoveTo(…) API by creating and initializing an LMDB::Value object for the key
bool MoveFirst( ): moves the cursor to the first key/value pair in the database. Returns false if the database is empty.
bool MoveFirstDup( ): Position at first data item of current key. Only for databases opened with the DatabaseFlags::DupSort flag
bool MoveNext( ): Moves the cursor to the next key/value pair in the database. Returns false if the cursor was positioned on the last entry or the database is empty
bool MoveNextDup( ): Moves the cursor to the next data item of the current key. Only for databases opened with the DatabaseFlags::DupSort flag
bool MoveNextNoDup( ): Moves the cursor to the next key/value pair. Only for databases opened with the DatabaseFlags::DupSort flag
bool MoveNextMultiple( ): Return up to a page of duplicate data items from next cursor position. Move cursor to prepare for MoveNextMultiple( ). Only for databases opened with the DatabaseFlags::DupFixed flag
bool MoveLast( ): Moves the cursor to the last key/value pair in the database. Returns false if the database is empty
bool MoveLastDup( ): Moves the cursor to the last data item of the current key. Only for databases opened with the DatabaseFlags::DupSort flag
bool MovePrevious( ): Moves the cursor to the previous key/value pair in the database. Returns false if the cursor was positioned on the first entry or the database is empty
bool MovePreviousDup( ): Moves the cursor to the previous data item of the current key. Only for databases opened with the DatabaseFlags::DupSort flag
bool MovePreviousNoDup( ): Moves the cursor to the last data item of the previous key. Only for databases opened with the DatabaseFlags::DupSort flag
bool MovePreviousMultiple( ): Moves the cursor to the previous page and return up to a page of duplicate data items
void Delete( bool noDupData = false ): Delete the current key/value pair. Set noDupData to true to delete all the data items for the current key if the database was opened with the DatabaseFlags::DupSort flag

Value

LMDB::Value objects are used to hold keys and values. LMDB::Value is derived from the MDB_val structure of the LMDB C API, and its role is to make sure that the two fields of the structure are always properly initialized.

The Segment and SegmentKey C++ Template Classes

The Segment and SegmentKey C++ template classes represent the data that will be stored inside the database.

Segment

The Segment class represents the data that is stored as a value, where the timeseries points are held in an std::array. That array may not be completely filled, and size_ is used to tell us how many of the slots in the array are actually filled with timeseries points.

C++

template<PointType TP, size_t maxSegmentSize >
class Segment
{
public:
    constexpr static size_t MaxSize = maxSegmentSize;
    using Point = TP;
    using ArrayType = std::array<Point, MaxSize>;
    using iterator = typename ArrayType::iterator;
    using const_iterator = typename ArrayType::const_iterator;
    ...
private:
    size_t size_;
    ArrayType points_;
public:
    Segment( )
        : size_( 0 )
    {
    }
    ...
};

Segment implements front( ), back( ), size( ), empty( ), data( ), begin( ), end(), cbegin( ), cend(), find(…) and operator[](…) which performs the operations you would normally expect.

The find(…) function uses std::lower_bound to locate the timeseries point with the specified timestamp, or the position in the segment where a timeseries point with that timestamp should be inserted:

C++

iterator find( const DateTime& timestamp )
{
    return std::lower_bound( begin( ), end( ), timestamp,
        []( const Point& point, const DateTime& timestamp )
    {
        return point.Timestamp( ) < timestamp;
    } );
}

Here, I used the overload that allows us to specify the predicate (or comparison) used for the search, this demonstrates a nice feature of std::lower_bound where I pass the timestamp, not creating a temporary Point object, to the function. The timestamp becomes the second argument to the binary predicate. This is both simple and efficient.

The most interesting, and performance critical, function of the Segment class is insert(…).

The timeseries points inside a segment are sorted according to the timestamp of the timeseries point, and since this is a class that will be used for timeseries data, the function tries to favor appends. The second parameter receives the overflow value, if any. This will only happen when the segment is full, and the timestamp of the timeseries point to insert does not match an existing timeseries point in the segment, which would then be overwritten.

The function also tries to make prepends efficient, as an insert into a timeseries in any but the last segment will cause overflow values to propagate from segment to segment, making this worth the effort.

Since the call to find(…) will never be called for a timestamp greater than or equal to the timestamp of the last timeseries point, we can skip testing it == end() before using the iterator.

This is not particularly complex or advanced function, it just demonstrates that by adding a few extra steps, optimizing for the common cases, we can make significant performance gains:

C++

void insert( const Point& point, std::optional<Point>& overflow )
{
    if ( size_ )
    {
        Point& last = points_[size_ - 1];
        if ( size_ < MaxSize )
        {
            if ( last.Timestamp( ) < point.Timestamp( ) )
            {
                points_[size_] = point;
                size_++;
            }
            else if ( last.Timestamp( ) == point.Timestamp( ) )
            {
                points_[size_ - 1] = point;
            }
            else if ( point.Timestamp( ) < points_[0].Timestamp( ) )
            {
                std::copy_backward( begin(), end( ), end( ) + 1 );
                points_[0] = point;
                ++size_;
            }
            else if ( point.Timestamp( ) == points_[0].Timestamp( ) )
            {
                points_[0] = point;
            }
            else
            {
                auto it = find( point.Timestamp( ) );
                if ( it->Timestamp( ) > point.Timestamp( ) )
                {
                    std::copy_backward( it, end( ), end( ) + 1 );
                    ++size_;
                }
                *it = point;
            }
        }
        else
        {
            if ( last.Timestamp( ) < point.Timestamp( ) )
            {
                overflow = point;
            }
            else if ( last.Timestamp( ) == point.Timestamp( ) )
            {
                points_[size_ - 1] = point;
            }
            else if ( point.Timestamp( ) < points_[0].Timestamp( ) )
            {
                overflow = last;
                std::copy_backward( begin(), end( ) - 1, end( ) );
                points_[0] = point;
            }
            else if ( point.Timestamp( ) == points_[0].Timestamp( ) )
            {
                points_[0] = point;
            }
            else
            {
                auto it = find( point.Timestamp( ) );
                if ( it->Timestamp( ) > point.Timestamp( ) )
                {
                    overflow = last;
                    std::copy_backward( it, end( ) - 1, end( ) );
                }
                *it = point;
            }
        }
    }
    else
    {
        points_[0] = point;
        size_++;
    }
}

I believe the overall performance of the timeseries storage engine proves that these extra steps is well worth the extra consideration given to appends and prepends.

The End (for now)

This concludes the first article about the timeseries storage engine. I am very open to suggestions to improvements to the API as long as they are not detrimental to the performance of the engine.

My plan for this series is to create a server on top of the engine, with a .NET client library and example web apps implemented on top of .NET Core.

So, until next time... happy coding!

History

5^th October, 2020
- Initial posting
6^th October, 2020
- Bug fix + cleaning up most of the unit tests
7^th October, 2020
- More unit tests for the Harlinn.Common.Core library
11^th February, 2021
- Bug fixes
- C++ ODBC support
- Added the ability to create complex keys that can be sorted using memcmp, which is useful when working with LMDB
25^th February, 2021
- Updated LMDB
- Updated xxHash
- Added the initial implementation of very fast hash based indexes for large complex keys using LMDB
- Fast asychronous logging - nearly done :-)
3^rd March, 2021
- New authorization related classes
  - SecurityId: Wrapper for SID and related operations
  - ExplicitAccess: Wrapper for EXCPLICIT_ACCESS
  - Trustee: Wrapper for TRUSTEE
  - SecurityIdAndDomain: Holds the result from LookupAccountName
  - LocalUniqueId: Wrapper for LUID
  - AccessMask: Makes it easy to inspect the rights assigned to an ACCESS_MASK
    - AccessMaskT<>
      - EventWaitHandleAccessMask: Inspect and manipulate the rights of an EventWaitHandle.
      - MutexAccessMask: Inspect and manipulate the rights of a Mutex.
      - SemaphoreAccessMask: Inspect and manipulate the rights of a Semaphore.
      - WaitableTimerAccessMask: Inspect and manipulate the rights of a WaitableTimer.
      - FileAccessMask: Inspect and manipulate file related rights.
      - DirectoryAccessMask: Inspect and manipulate directory related rights.
      - PipeAccessMask: Inspect and manipulate pipe related rights.
      - ThreadAccessMask: Inspect and manipulate thread related rights.
      - ProcessAccessMask: Inspect and manipulate process related rights.
  - GenericMapping: Wrapper for GENERIC_MAPPING
  - AccessControlEntry: This is a set of tiny classes that wraps the ACE structures
    - AccessControlEntryBase<,>
      - AccessAllowedAccessControlEntry
      - AccessDeniedAccessControlEntry
      - SystemAuditAccessControlEntry
      - SystemAlarmAccessControlEntry
      - SystemResourceAttributeAccessControlEntry
      - SystemScopedPolicyIdAccessControlEntry
      - SystemMandatoryLabelAccessControlEntry
      - SystemProcessTrustLabelAccessControlEntry
      - SystemAccessFilterAccessControlEntry
      - AccessDeniedCallbackAccessControlEntry
      - SystemAuditCallbackAccessControlEntry
      - SystemAlarmCallbackAccessControlEntry
    - ObjectAccessControlEntryBase<,>
      - AccessAllowedObjectAccessControlEntry
      - AccessDeniedObjectAccessControlEntry
      - SystemAuditObjectAccessControlEntry
      - SystemAlarmObjectAccessControlEntry
      - AccessAllowedCallbackObjectAccessControlEntry
      - AccessDeniedCallbackObjectAccessControlEntry
      - SystemAuditCallbackObjectAccessControlEntry
      - SystemAlarmCallbackObjectAccessControlEntry
  - AccessControlList: Wrapper for ACL
  - PrivilegeSet: Wrapper for PRIVILEGE_SET
  - SecurityDescriptor: Early stage implementation of wrapper for SECURITY_DESCRIPTOR
  - SecurityAttributes: Very early stage implementation of wrapper for SECURITY_ATTRIBUTES
  - Token: Early stage implementation of wrapper for an access token
  - DomainObject
    - User: Information about a local, workgroup or domain user
    - Computer: Information about a local, workgroup or domain computer
    - Group: local, workgroup or domain group
  - Users: Vector of User objects
  - Groups: Vector of Group objects
14^th of March, 2021 - more work on security related stuff:
- Token: A wrapper for a Windows access token with a number of supporting classes like:
  - TokenAccessMask: An access mask implementation for the access rights of a Windows access token.
  - TokenGroups: A wrapper/binary compatible replacement for the Windows TOKEN_GROUPS type with a C++ container style interface.
  - TokenPrivileges: A wrapper/binary compatible replacement for the TOKEN_PRIVILEGES type with a C++ container style interface.
  - TokenStatistics: A binary compatible replacement for the Windows TOKEN_STATISTICS type using types implemented by the library such as LocalUniqueId, TokenType and ImpersonationLevel.
  - TokenGroupsAndPrivileges: A Wrapper/binary compatible replacement for the Windows TOKEN_GROUPS_AND_PRIVILEGES type.
  - TokenAccessInformation: A wrapper/binary compatible replacement for the Windows TOKEN_ACCESS_INFORMATION type.
  - TokenMandatoryLabel: A wrapper for the Windows TOKEN_MANDATORY_LABEL type.
- SecurityPackage: Provides access to information about a Windows security package.
- SecurityPackages: An std::unordered_map of information about the security packages installed on the system.
- CredentialsHandle: A wrapper for the Windows CredHandle type.
- SecurityContext: A wrapper for the Windows CtxtHandle type
- Crypto::Blob and Crypto::BlobT: C++ style _CRYPTOAPI_BLOB replacement
- CertificateContext: A wrapper for the Windows PCCERT_CONTEXT type, provides access to a X.509 certificate.
- CertificateChain: A wrapper for the Windows PCCERT_CHAIN_CONTEXT type which contains an array of simple certificate chains and a trust status structure that indicates summary validity data on all of the connected simple chains.
- ServerOcspResponseContext: Contains an encoded OCSP response.
- ServerOcspResponse: Represents a handle to an OCSP response associated with a server certificate chain.
- CertificateChainEngine: Represents a chain engine for an application.
- CertificateTrustList: A wrapper for the Windows PCCTL_CONTEXT type which contains both the encoded and decoded representations of a CTL. It also contains an opened HCRYPTMSG handle to the decoded, cryptographically signed message containing the CTL_INFO as its inner content.
- CertificateRevocationList: Contains both the encoded and decoded representations of a certificate revocation list (CRL)
- CertificateStore: A storage for certificates, certificate revocation lists (CRLs), and certificate trust lists (CTLs).
23^rd of March, 2021:
- Updated to Visual Studio 16.9.2
- Build fixes
- SecurityDescriptor: Implemented serialization for security descriptors, enabling persistence of authorization data

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)