This article presents a new ultrafast timeseries storage engine. The solution includes several examples that write 1 000 000 000 timeseries values to the timeseries storage demonstrating the performance for various workloads.
Introduction
A timeseries is a table indexed in time order, and the information contained within timeseries tables is the main ingredient to many machine learning algorithms.
The information within a timeseries is anything that changes over time, such as:
- Stock prices and volume
- Precipitation
By analyzing stock prices, you can attempt to predict future stock prices; and by analyzing precipitation, you can make qualified guesses about future floodings or droughts, or the price of electricity if the main source of electricity is hydro power.
Ignoring cat pictures, there is probably more data stored as timeseries than any other kind of data.
In their book, Time Series Databases: New Ways to Store and Access Data, Ellen Friedman and Ted Dunning make a good argument for using a noSQL Database with a hybrid design to implement a high-performance timeseries database engine, using Open TSDB as a reference case.
The concept is simple: Use blobs to store the timeseries data.
My intention for this article, and the accompanying source code, is to show that the concept works – and that it can be implemented in easily understood C++. You can easily follow the code for this article if you have a background as a C# developer, deep C++ knowledge is not required.
Building the Code
The boost C++ libraries are required to build and run the code. The provided Visual Studio projects expect the environment variable BOOST_ROOT
to point to the directory where you unpacked the distribution, and that the library files can be found in $(BOOST_ROOT)\stage\lib.
Make sure you compile for x64, and if you plan to test the code on a system that does not support AVX2, you need to change the Enable Enhanced Instruction Set setting under C/C++ ⇒ Code Generation to a value suitable for your system.
The performance tests use an environment variable HCC_TEST_DATA_ROOT
, and this must be set to the full path of a directory where the tests can create a subdirectory for the database files. Make sure there is at least 40 GB of free space on the drive containing this directory.
Performance
Since I call this as an ultrafast timeseries storage, I should have some numbers to back that claim.
So, before going into the details of the solution, I would like to present the results of some performance tests that I have used to get an idea about how well this works. The source code for the programs are included with the download for this article, and when evaluating the results: Please keep in mind that these tests were executed on a laptop, not a high-performance server.
The timings are from before writing the first timeseries point until the last one is written, and the transaction is committed to the database file. Reading is done in a new transaction scope.
One of the programs, HExPerf01.exe, writes one billion timeseries values to a single timeseries, and then reads back all the data:
Database directory:F:\Database\LMDB
Wrote 1000000000 timeseries points in 43.384699 seconds - points pr. second: 23049600.966461
Read 1000000000 timeseries points in 1.232314 seconds - points pr. second: 811481423.445532
Sum inserted:500000000067108992.000000,
Sum read: 500000000067108992.000000
This is certainly promising, as the program
- inserted more than 23 million timeseries points per second
- read more than 811 million timeseries points per second
1 000 000 000 timeseries points is enough data to demonstrate that the timeseries storage engine is something more than a toy.
The timeseries point is a record with three fields:
Timestamp
: DateTime
Flags
: Int64
Value
: double
The size of each record is 24 bytes, so the size of 1 000 000 000 timeseries points is nearly 24 GB.
Sum inserted is the sum of all the values for the timeseries points written to the storage, and Sum read is the sum of all the values for the timeseries points read from the storage. The identical sums verify that we read back the same data that was written.
Run it again, and it writes the timeseries points even faster:
Database directory:F:\Database\LMDB
Wrote 1000000000 timeseries points in 23.693139 seconds - points pr. second: 42206310.574444
Read 1000000000 timeseries points in 1.181736 seconds - points pr. second: 846212833.697684
Sum inserted:500000000067108992.000000,
Sum read: 500000000067108992.000000
This time, the program
- inserted more than 42 million timeseries points per second
- read more than 846 million timeseries points per second
The write performance went up because the timeseries storage engine reuses the storage file created during the first run.
The performance looks good, but this is hardly a normal use case for a timeseries storage engine.
Over the years, I have looked at several timeseries engine benchmarks, and they usually write timeseries points in batches. HExPerf03.exe writes one billion timeseries values, spreading the data over 10 000 timeseries, writing the timeseries points to the timeseries in batches of 250 values:
for ( size_t i = 0; i < BatchCount; ++i )
{
for ( auto& timeseriesId : timeseriesIds )
{
timeseriesCursor1.ChangeTimeseries( timeseriesId );
for ( size_t j = 0; j < BatchSize; ++j )
{
Int64 value = static_cast<Int64>( ( i * BatchSize ) + j ) + 1;
sumWritten += value;
timeseriesCursor1.Insert( Point( DateTime( value ),
0, static_cast<double>( value ) ) );
}
}
}
Reading, and calculating the simple checksum for the data:
size_t totalRows = 0;
for ( auto& timeseriesId : timeseriesIds )
{
timeseriesCursor2.ChangeTimeseries( timeseriesId );
totalRows += timeseriesCursor2.ForEach( []( const Point& point, double& sumRead )
{
sumRead += point.Value( );
}, sumRead );
}
Output:
Database directory:F:\Database\LMDB
Inserted 1000000000 timeseries points into 10000 timeseries in 27.847767 seconds
- points pr. second: 35909521.937612
Read 1000000000 timeseries points from 10000 timeseries in 2.563846 seconds
- points pr. second: 390039103.370308
Sum inserted:50000500000000.000000,
Sum read: 50000500000000.000000
As expected, the performance dropped, but the timeseries
engine is still able to
- insert more than 35 million timeseries points per second
- read more than 390 million timeseries points per second
The most interesting test for a timeseries
storage engine is how well it handles writing one value at the time to the timeseries
storage. If you have 10 000 sensors reporting changes at the same time, you get a very different workload from the ones demonstrated so far. Like the previous program, HExPerf04.exe writes 1 000 000 000 values spreading the data across 10 000 timeseries
, but writes one value to a timeseries
before moving on to the next one:
for ( size_t i = 0; i < NumberOfPoints; ++i )
{
for ( auto& timeseriesId : timeseriesIds )
{
timeseriesCursor1.ChangeTimeseries( timeseriesId );
Int64 value = static_cast<Int64>( i + 1 );
sumWritten += value;
timeseriesCursor1.Insert( DateTime( value ), 0, static_cast<double>( value ) );
}
}
Output:
Database directory:F:\Database\LMDB
Inserted 1000000000 timeseries points into 10000 timeseries in 76.781371 seconds
- points pr. second: 13023992.431581
Read 1000000000 timeseries points from 10000 timeseries in 2.515501 seconds
- points pr. second: 397535075.811728
Again, the performance dropped, and this program
- inserted more than 13 million timeseries points per second
- read more than 397 million timeseries points per second
Note that the read performance varies greatly, and for this workload, it strays between 300 and 400 million timeseries points per second, while the write performance is fairly stable. This is probably because the read performance relies heavily on the cache mechanisms provided by Windows for memory mapped IO, which tries to balance the needs of all the applications running on the system.
Since each of the tests above wrote all the data just before reading it back again, the cache is hot, and this is the main reason for the rather extraordinary read performance. To demonstrate the read performance of the timeseries
storage engine when the cache is cold, HExPerf05.exe opens an existing storage and reads all the timeseries
points once:
Database directory:F:\Database\LMDB
Read 1000000000 timeseries points from 10000 timeseries in 10.476408 seconds
- points pr. second: 95452568.067823
Sum read: 50000500000000.000000
As expected, the read performance dropped, but it is still exceptional: reading more than 95 million timeseries
points per second with little benefit from cached information.
To put these numbers into perspective: There were 24 441 833 trades on Nasdaq on October the 1st, 2020 – and the timeseries
storage engine can store 1 000 000 000 trades in 96 seconds when the record for each trade contains:
Timestamp
: DateTime
BidId
: 64-bit bid identifier OfferId
: 64-bit offer identifier Volume
: 64-bit floating point number for the traded volume Price
: Currency 64-bit fixed point for price per volume unit
Each record is 40 bytes, and the timeseries
storage engine stored close to 40 Gb in 96 seconds. So, if this were the information stored for each trade, it would be able to store all the trades for one trading day in less than 3 seconds.
The HExPerf06
project also demonstrates how easy it is to use a custom timeseries
point class with the timeseries
storage engine.
If you try out these tests on your own system, be aware that your anti-virus solution may have a negative impact on the performance of the tests. Performance should still be outstanding, but not as good as they will be if you turn off the anti-virus monitoring for the folder containing the datafiles.
I guess Ellen and Ted knew what they were writing about, and of course, the timeseries
storage engine takes advantage of the nature of timeseries
data, and it is optimized for the most common usage patterns.
Timeseries Data
An efficient timeseries
storage engine processes data differently from a regular database engine since
- data is mostly appended to a
timeseries
- updates to existing data are comparatively rare
- the record for a timeseries point usually have a fixed size
- large volumes of data, sometimes billions of records, are read and written each day
A typical windmill used for power production has multiple eddy current and displacement sensors, accelerometers, and wind and temperatures sensors; all continuously reporting information. The information needs to be analyzed in real-time, often using complex operations on timeseries
to detect changes that are critical to safe and optimal operation of a wind park:
- Trend analysis can be used to compare values from normal operating conditions with the current values.
- Time synchronous averaging is commonly used for gear condition monitoring, resampling the vibration data synchronously with shaft rotation in order to extract periodic waveforms from noisy data. This technique can be used to identity rotating bearing or gearbox defects.
- Amplitude demodulation is used to detect defects that produce impacting, such as rolling contacts in bearings and tooth-to-tooth contacts in the gear meshes.
- Fast-Fourier transform, and spectrum analysis makes it possible to distinguish between normal rotation and characteristic defect frequencies.
These are just examples of the kind of analysis that can make use of large amounts of high resolution timeseries
information, where much more data will be read than written. I think this makes a good case for a timeseries
storage engine with extraordinary read performance and good write performance.
The Timeseries Engine
Using an existing noSQL database engine simplifies things a lot, but it also means that the design must be tailored to the strengths and weaknesses of that engine.
The design, I am going to present here, heavily favors reading over writing, and I chose to use the Lightning Memory-Mapped Database, https://symas.com/lmdb/, (LMDB) for the blob storage because it is small enough to be included with the source code for the article.
LMDB is a remarkably compact noSQL storage engine, and developing for this engine is probably a bit like driving a formula 1 racing car: Use it right and you will get blazing performance, but this can drop rather quickly if you don’t play to its strengths.
The timeseries storage engine is implemented as a set of C++ classes and templates, which allows you to specify your own type for each timeseries
point, and the number of timeseries
points that can be stored inside a single blob. The only requirement for a timeseries
point is that it must satisfy this concept:
template<typename T>
concept PointType = requires(T t)
{
{ t.Timestamp() } -> std::convertible_to<DateTime>;
std::is_default_constructible_v<T>;
std::is_standard_layout_v<T>;
};
The library provides one class, TimeseriesPoint
, which satisfies these requirements, and this class will be used throughout the series. It has only three data members:
class TimeseriesPoint
{
DateTime timestamp_;
Int64 flags_;
double value_;
public:
...
};
Timestamps are stored as 64-bit integers, compatible with .NET System.DateTime
ticks in UTC.
This is a fairly common representation of a timeseries
point used to store measurement data, end the initial storage overhead for the B+ tree is just above 1%.
The strategy for the blob storage is simple:
Timeseries
points are bundled together in fixed size segments which are stored in the blobs. - A
timeseries
is identified by a Guid
. - The key for a segment in the database has two fields:
- the
Guid
for the timeseries
- the
timestamp
of the first timeseries
point in the segment
- All segments for a
timeseries
, except the last one, are completely filled with data. - Empty segments will be deleted, there should be no empty segments in the database.
- Each segment holds data for an interval, and data for that interval can only be present in that segment.
Timeseries
points can be unevenly distributed over time, which is good for solutions that mainly handle events that happen at irregular intervals.
Since Guid
is used extensively by the project, I have given the class a workover, and the Guid
class now relies on SSE 4.1. Comparison has a slight performance edge over boost::uuids::uuid
. There is no good reason for this as they do the same thing for basic comparison. It is when Guid
is used as a key with std::unordered_map<K,V>
that its strength becomes apparent as it makes searches in the map more than twice as fast as using std::unorderd_map<>
with boost::uuids::uuid
and boost::hash<boost::uuids::uuid>
.
The value of a Guid
created by CoCreateGuid
is fairly random, so there is really no need to perform any calculation in the specialization of std::hash<>
to get a good distribution for hash keys:
namespace std
{
template<> struct hash<Harlinn::Common::Core::Guid>
{
constexpr std::size_t operator()( const Harlinn::Common::Core::Guid& guid ) const noexcept
{
return guid.Lo( );
}
};
}
Since a Guid
is just a 16 byte value, guid.Lo( )
returns the 8 bytes starting at the 9th byte of the Guid
, and this strategy works very well with std::unorderd_map<>
where the performance of the hash function plays a major role.
Using the Engine
Using the timeseries
storage engine is easy:
int main()
{
using Engine = Timeseries::Engine<>;
using Point = typename Engine::Point;
using TimeseriesCursor = typename Engine::TimeseriesCursor;
DateTime timestamp1( 2020, 1, 1 ), timestamp2( 2020, 1, 2 ),
timestamp3( 2020, 1, 3 ), timestamp4( 2020, 1, 4 );
First, we just accept the default parameters for the Timeseries::Engine
template, that specifies that TimeseriesPoint
will be used as the representation for timeseries
points and that the segments will have room for 8100 entries, and then create a couple of aliases for the timeseries
point type and the timeseries
cursor. The four timestamp
s hardly need an explanation.
auto DatabaseDir = GetDatabaseDir( );
printf( "Database directory:%s\n", DatabaseDir.c_str( ) );
TimeseriesEngineOptions options( DatabaseDir, true );
auto timeseriesId1 = Test::Ids[0];
GetDatabaseDir()
retrieves the value of the HCC_TEST_DATA_ROOT
environment variable, appends "\\LMDB" and makes sure the resulting path exists. The second parameter to the TimeseriesEngineOptions
constructor specifies whether we want to create a new environment or not. Test::Ids
is just an array of 10 000 predefined GUID
s, and we pick the first one to identify the timeseries
.
Engine engine( options );
auto transaction = engine.BeginTransaction( );
auto timeseriesCursor = transaction.OpenTimeseries( timeseriesId1 );
The engine constructor initializes LMDB, and now the engine is ready for use. engine.BeginTransaction( );
creates a new write transaction, and we open a TimeseriesCursor
in that transaction using OpenTimeseries(…)
passing the Guid
identifying the timeseries
as the argument. Finally, we are ready to insert some data:
timeseriesCursor.Insert( timestamp1, 1 );
timeseriesCursor.Insert( timestamp2, 2 );
timeseriesCursor.Insert( timestamp3, 3 );
timeseriesCursor.Insert( timestamp4, 4 );
And then, we iterate over the timeseries
:
if ( timeseriesCursor.MoveFirst( ) )
{
do
{
auto& current = timeseriesCursor.Current( );
std::cout << "Timestamp: " << current.Timestamp( )
<< ", Flags: " << current.Flags( )
<< ", Value: " << current.Value( ) << std::endl;
} while ( timeseriesCursor.MoveNext( ) );
}
Searching is a bit more interesting:
DateTime timestamp( 2020, 1, 2, 12, 0, 0 );
There are no timeseries
points in the storage with this timestamp
, but it is still relevant to be able to use this timestamp
to search for the previous timeseries
point:
auto found = timeseriesCursor.Search( timestamp );
if ( found )
{
auto& current = timeseriesCursor.Current( );
std::cout << "Found Timestamp: " << current.Timestamp( )
<< ", Flags: " << current.Flags( )
<< ", Value: " << current.Value( ) << std::endl;
}
At this point, we are done with what we want to do, so it is time to clean things up:
timeseriesCursor.Close( );
transaction.Commit( );
}
Neither of the last two lines of code are strictly needed, but without the commit, the destructor will roll back any changes made to the timeseries
storage.
Database directory:F:\Database\LMDB
Timestamp: 01.01.2020 00:00:00, Flags: 0, Value: 1
Timestamp: 02.01.2020 00:00:00, Flags: 0, Value: 2
Timestamp: 03.01.2020 00:00:00, Flags: 0, Value: 3
Timestamp: 04.01.2020 00:00:00, Flags: 0, Value: 4
Found Timestamp: 02.01.2020 00:00:00, Flags: 0, Value: 2
This covers the most important operations for a timeseries
storage engine. The tricky part is to make sure these operations can be performed quickly, everything else is ancillary.
The timeseries
engine has three main parts:
- the
Engine
, Transaction
and TimeseriesCursor
C++ template classes - the
Segment
and SegmentKey
C++ template classes - the LMDB C++ classes
The Engine, Transaction and TimeseriesCursor C++ Template Classes
The Engine
C++ template class is the entry point for working with the timeseries
storage engine. By default, the size of a segment, which is the maximum number of timeseries
points that can be written to a blob, is set to 8100, and the type used to represent a single timeseries
point is TimeseriesPoint
.
The constructor takes a single argument, a reference to an EngineOptions
object. EngineOptions
lets you specify the directory that contains the LMDB datafiles, and whether to create a new storage or open an existing one. Apart from the constructor and destructor, the Engine
template exposes a single function:
Transaction<TP, segmentSize>
BeginTransaction( TransactionFlags transactionFlags = TransactionFlags::None ) const
{
auto lmdbTransaction = environment_.BeginTransaction( transactionFlags );
Transaction<TP, segmentSize> result( const_cast<Engine*>(this),
std::move( lmdbTransaction ) );
return result;
}
BeginTransaction(…)
creates a new Transaction
object, which wraps an LMDB::Transaction
object and caches information on behalf of the cursors.
This cache is implemented as an instance of:
std::unordered_map<Guid, std::unique_ptr<TimeseriesInfoType>> timeseriesMap_;
where TimeseriesInfoType
is an instantiation of the following template:
template<Segments::PointType TP, size_t segmentSize>
class TimeseriesInfo
{
public:
using SegmentType = Segments::Segment< TP, segmentSize>;
private:
DateTime minTimestamp_;
DateTime maxTimestamp_;
SegmentType modificationBuffer_;
size_t changes_ = 0;
std::optional<DateTime> lastSegmentTimestamp_;
DateTime loadedTimestamp_;
Guid id_;
public:
...
};
Here, we keep a few details about each of the timeseries
, such as the timestamp
of the first, and the last, timeseries
point in the timeseries
. Then, we have the modification buffer and its change count, and the timestamp
of the last segment stored in LMDB. loadedTimestamp_
contains the value of the timestamp
for the first timeseries
point in the modification buffer, at the time it was loaded.
This is enough information to maintain a cache for updates to the storage, boosting the write performance significantly. Commit
flushes any buffered modification to storage:
void Commit( )
{
SaveCachedUpdates( );
lmdbTransaction_.Commit( );
timeseriesMap_.clear( );
}
and SaveCachedUpdates( )
just iterates over entries in the timeseriesMap_
:
void SaveCachedUpdates( )
{
auto database = TimeseriesDataTable( );
for ( auto& entry : timeseriesMap_ )
{
TimeseriesInfoType* timeseriesInfo = entry.second.get( );
if ( timeseriesInfo->Changes( ) )
{
timeseriesInfo->SetChanges( 0 );
auto modificationBuffer = timeseriesInfo->ModificationBuffer( );
const auto& timeseries = timeseriesInfo->Id( );
auto loadedTimestamp = timeseriesInfo->LoadedTimestamp( );
auto& first = modificationBuffer->front( );
if ( loadedTimestamp && ( loadedTimestamp != first.Timestamp( ) ) )
{
KeyData deleteKey( timeseries, loadedTimestamp );
lmdbTransaction_.Delete( database, deleteKey );
}
KeyData keyData( timeseries, first );
lmdbTransaction_.Write( database, keyData, *modificationBuffer );
timeseriesInfo->SetLoadedTimestamp( first.Timestamp( ) );
}
}
}
When writing a segment to the storage, there is one thing that must be handled properly: The first timeseries
point may not have the same timestamp
as the first timestamp
in the modification buffer had when it was loaded. If this is the case, then we must delete original key/value pair from the storage before storing the modification buffer and updating the loadedTimestamp_
field of the TimeseriesInfo
object. Implementing a write cache does not have to be complicated, it just has to be efficient.
The TimeseriesCursor
class relies on the GetTimeseriesInfo(…)
function of the Transaction
class to retrieve the TimeseriesInfo
for a timeseries
:
TimeseriesInfoType* GetTimeseriesInfo( const Guid& timeseries )
{
auto it = timeseriesMap_.find( timeseries );
if ( it != timeseriesMap_.end( ) )
{
return it->second.get( );
}
else
{
auto timeseriesInfo = std::make_unique<TimeseriesInfoType>( timeseries );
auto* result = timeseriesInfo.get( );
timeseriesMap_.emplace( timeseries, std::move( timeseriesInfo ) );
return result;
}
}
and if this is the first time the function is called for a particular timeseries
, it creates a new instance and transfers ownership of the object to the timeseriesMap_
.
TimeseriesCursor
has few fields on its own:
template<Segments::PointType TP = TimeseriesPoint,
size_t segmentSize = TimeseriesCursorSegmentSize>
class TimeseriesCursor
: public Segments::SegmentContainer<
TimeseriesCursor<TP, segmentSize>, Guid, TP, segmentSize>
{
public:
static constexpr size_t SegmentSize = segmentSize;
using Base = Segments::SegmentContainer<
TimeseriesCursor<TP, segmentSize>, Guid, TP, segmentSize>;
using Point = typename Base::Point;
using Overflow = std::optional<Point>;
using SegmentType = typename Base::DataType;
using KeyData = typename Base::KeyType;
using TransactionType = Transaction<TP, segmentSize>;
private:
using TimeseriesInfoType = Internal::TimeseriesInfo<TP, segmentSize>;
size_t position_ = 0;
TimeseriesCursorState state_ = TimeseriesCursorState::Unknown;
TimeseriesInfoType* timeseriesInfo_;
SegmentType* currentSegmentData_;
LMDB::Cursor cursor_;
TransactionType* transaction_;
public:
...
};
position_
is currently 1 based, which was probably not a good idea, as I have to subtract 1 all the time to get to the current timeseries
point within the modification buffer, or the segment pointed to by currentSegmentData_
. Which one is determined by the state_
of the cursor. currentSegmentData_
usually points to memory managed by LMDB, and may point to memory that is mapped to the datafile.
The TimeseriesCursor
class implements operations for searching and navigating through the data for a timeseries
.
const Point& Current( ) const noexcept
: returns a reference to the timeseries
point at the current position of the cursor void Flush( )
: flushes the modification buffer to storage void Insert( const Point& point )
: inserts or overwrites a timeseries point in the timeseries
SearchResult Search(const DateTime& timestamp )
: searches the timeseries
for a timeseries
point with a timestamp
less or equal to the given timestamp
bool MoveFirst( )
: moves the cursor to the first timeseries
point in the timeseries
bool MoveLast( )
: moves the cursor to the last timeseries
point in the timeseries
bool MoveNext( )
: positions the cursor on the next timeseries
point in the timeseries
bool MovePrevious( )
: positions the cursor on the previous timeseries
point in the timeseries
size_t ForEach( Func&& func, Args&& ...args )
: provides fast access to every timeseries
point in the timeseries
size_t ForEach( const DateTime& start, Func&& func, Args&& ...args )
: provides fast access to the timeseries
points in the timeseries
starting with the timeseries
point with timestamp
equal to start
, or the timeseries
point with the greatest timestamp
less than start
if there is none with an equal timestamp
size_t ForEach( const DateTime& start, const DateTime& end, Func&& func, Args&& ...args )
: similar to the previous overload, but stops the iteration at the last timeseries
point with a timestamp
less than end
One of the complicating factors for the implementation of the TimeseriesCursor
is that searching using LMDB places the LMDB cursor at the position that either matches the search key, or places the cursor on the next entry in the B+ tree. The downside of this is that when this mechanism is used to locate a timeseries
point, the search will nearly always place the cursor on the next key/value pair.
The key for a segment has two parts:
- the
Guid
identifying the timeseries
- the
timestamp
for the first timeseries
point in the segment
This means that if the timestamp
is earlier than any timestamp
for a timeseries
point in the timeseries
, simply moving to the previous entry is not good enough, as in this case, that will put the cursor on an entry with a segment of timeseries
points belonging to another timeseries
. In this case, the search has to move back to the entry located by LMDB and set the compareResult_
field of the SearchResult
to CompareResult::Greater
, indicating that the cursor is positioned on the first timeseries
point of the timeseries
with a greater timestamp
than the one that was passed to Search(…)
.
Since a timeseries
point can be inserted anywhere in a timeseries
, the solution must be able to handle situations where it will replace the first timeseries
point in a segment. In this case, it must look at the end of the previous segment and determine that the timestamp
of the new timeseries
point is greater than the timestamp
of the last timeseries
point in the previous segment. If this is the case, the cursor is moved to the segment located by the search functionality of LMDB and the timeseries
point is inserted in front of the segment. This causes the key for the segment to change, and the entry with the old key must be deleted when flushing the changes to the storage.
The LMDB C++ Classes
The LMDB is a compact implementation of a transactional key/value store, and HCCLMDB.h contains a set of classes that wrap the relevant parts of the LMDB C API. The Visual Studio 2019 solution, provided with the source code for this article, includes the libLMDB
project which creates a DLL for LMDB.
In LMDB terminology, an environment provides access to a data file and an accompanying lock file, while a database provides access to a key/value storage stored in the data file of the environment. It is tempting to call this a table, and multiple named databases can reside in one environment.
Nearly all LMDB operations use transactions, and LMDB provides support for two kinds of transactions, one for reading and one for writing. Readers do not block writers, and writers do not block readers. An environment can be opened by multiple processes running on the same machine, and each environment supports one concurrent write transaction.
By default, keys are ordered lexicographically, but you can supply your own comparison function.
Named Databases
To open more than one database in an environment, each must be named, and the maximum number of databases that can be opened in the environment must be specified. This must be done by the first process or thread creating or opening the environment.
This is handled transparently by the LMDB::Environment
constructor, which takes the maximum number of databases as its third argument.
explicit Environment( const char* path,
size_t newMemoryMapSize = 10485760,
unsigned int maxDatabases = 0,
EnvironmentFlags environmentFlags = EnvironmentFlags::WriteMemMap );
The LMDB C++ classes make it easier to work with LMDB, primarily by wrapping the handle types from the LMDB C API using a C++ class for each handle type:
Environment
Transaction
Database
Cursor
Errors reposted by the LMDB C API are turned into C++ exceptions, and the library tries to provide sensible default values for many function parameters.
Environment
The application must create an LMDB::Environment
object before any work can be performed using LMDB.
An environment contains at most one anonymous database, or it can contain multiple named databases, residing in a single memory mapped file.
The simplest way to create a new LMDB::Environment
object only requires the path to the directory that contains, or will contain, the database files:
LMDB::Environment environment( DatabaseDir );
Transaction
All database operations require a transaction
, and transaction
s can be read-write or read-only.
Write transaction
s may not span threads. A new read-write transaction
is created by calling:
auto transaction = environment.BeginTransaction( );
While a read-only transaction
is created with the following statement:
transaction = environment.BeginTransaction( LMDB::TransactionFlags::ReadOnly );
The changes performed in a transaction
are committed to the database using:
transaction.Commit( );
while:
transaction.Abort( );
is used to roll back all the changes made in the transaction
. The destructor of a Transaction
object will also roll back any changes not committed to the environment.
Before you can make any changes to database, it must be opened:
auto database = transaction.OpenDatabase( );
and once a database is opened, it can be modified:
transaction.Write( database, 1, 1);
or you can open a cursor on the database in a transaction
:
auto cursor = transaction.OpenCursor( database );
Database
The LMDB::Database
class wraps a database handle in an LMDB environment.
A database is opened using:
class Transaction
{
...
public:
Database OpenDatabase( const char* name, DatabaseFlags databaseFlags = DatabaseFlags::None );
Database OpenDatabase( DatabaseFlags databaseFlags = DatabaseFlags::None )
...
};
The database handle will be private
to the current transaction until the transaction is committed. After a successful commit, the handle resides in the environment, ready for use by other transactions.
Transaction::OpenDatabase(…)
must not be called from multiple concurrent transactions in the same process, and the transaction must either commit or abort before any other transaction in the process can call Transaction::OpenDatabase(…)
. If the transaction is aborted, the database handle will be closed automatically.
Transaction::OpenDatabase(…)
returns the existing database handle when it is called for a database that is already open in the environment. Database handles can only be closed once by calling Environment::CloseDatabase(…)
.
Cursor
LMDB::Cursor
objects provides functionality that can be used to navigate, search and modify key/value pairs in the database:
constexpr bool IsPositioned( ) const noexcept
: returns true
if the cursor is positioned on a key/value pair const LMDB::Value& Key( ) const noexcept
: returns a reference to the current key template<typename T> const T& Key( ) const noexcept
: template used to cast the contents of the key to the specified type bool SetKey( const LMDB::Value& key )
: Moves the cursor to the specified key, returns false
if the specified key/value pair does not exist const LMDB::Value& Value( ) const noexcept
: returns a reference to the current value template<typename T> const T& Value( ) const noexcept
: template used to cast the contents of the current value to the specified type bool SetValue( const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None )
: updates the value at the current cursor position bool Write( const LMDB::Value& key, const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None )
: stores a key/value pair in the database bool Write( const LMDB::Value& key, const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None )
: Write a key/value pair to the database using the cursor template<ValueType T1, ValueType T2> bool Write( const T1& key, const T2& value, WriteFlags writeFlags = WriteFlags::None )
: This template creates and initializes LMDB::Value objects for the key and value, simplifying the API. bool Write( const LMDB::Value& value, WriteFlags writeFlags = WriteFlags::None )
: Updates the value at the current cursor position template<ValueType T> bool Write( const T& value, WriteFlags writeFlags = WriteFlags::None )
: Simplifies updating the value at the current cursor position by initializing the LMDB::Value
object for the argument value bool Search( const LMDB::Value& key )
: searches the database for a key/value pair with a key equal to, or greater than the specified search key template<ValueType T> bool Search( const T& key )
: simplifies the Search(…)
API by creating and initializing an LMDB::Value
object for the key bool MoveTo( const LMDB::Value& key )
: moves the cursor to the key/value pair for the argument key, returns false
if the key does not exist template<ValueType T> bool MoveTo(const T& key )
: simplifies the MoveTo(…)
API by creating and initializing an LMDB::Value
object for the key bool MoveFirst( )
: moves the cursor to the first key/value pair in the database. Returns false
if the database is empty. bool MoveFirstDup( )
: Position at first data item of current key. Only for databases opened with the DatabaseFlags::DupSort
flag bool MoveNext( )
: Moves the cursor to the next key/value pair in the database. Returns false
if the cursor was positioned on the last entry or the database is empty bool MoveNextDup( )
: Moves the cursor to the next data item of the current key. Only for databases opened with the DatabaseFlags::DupSort
flag bool MoveNextNoDup( )
: Moves the cursor to the next key/value pair. Only for databases opened with the DatabaseFlags::DupSort
flag bool MoveNextMultiple( )
: Return up to a page of duplicate data items from next cursor position. Move cursor to prepare for MoveNextMultiple( )
. Only for databases opened with the DatabaseFlags::DupFixed
flag bool MoveLast( )
: Moves the cursor to the last key/value pair in the database. Returns false
if the database is empty bool MoveLastDup( )
: Moves the cursor to the last data item of the current key. Only for databases opened with the DatabaseFlags::DupSort
flag bool MovePrevious( )
: Moves the cursor to the previous key/value pair in the database. Returns false
if the cursor was positioned on the first entry or the database is empty bool MovePreviousDup( )
: Moves the cursor to the previous data item of the current key. Only for databases opened with the DatabaseFlags::DupSort
flag bool MovePreviousNoDup( )
: Moves the cursor to the last data item of the previous key. Only for databases opened with the DatabaseFlags::DupSort
flag bool MovePreviousMultiple( )
: Moves the cursor to the previous page and return up to a page of duplicate data items void Delete( bool noDupData = false )
: Delete the current key/value pair. Set noDupData
to true to delete all the data items for the current key if the database was opened with the DatabaseFlags::DupSort
flag
Value
LMDB::Value
objects are used to hold keys and values. LMDB::Value
is derived from the MDB_val
structure of the LMDB C API, and its role is to make sure that the two fields of the structure are always properly initialized.
The Segment and SegmentKey C++ Template Classes
The Segment
and SegmentKey
C++ template classes represent the data that will be stored inside the database.
Segment
The Segment
class represents the data that is stored as a value, where the timeseries
points are held in an std::array
. That array may not be completely filled, and size_
is used to tell us how many of the slots in the array are actually filled with timeseries
points.
template<PointType TP, size_t maxSegmentSize >
class Segment
{
public:
constexpr static size_t MaxSize = maxSegmentSize;
using Point = TP;
using ArrayType = std::array<Point, MaxSize>;
using iterator = typename ArrayType::iterator;
using const_iterator = typename ArrayType::const_iterator;
...
private:
size_t size_;
ArrayType points_;
public:
Segment( )
: size_( 0 )
{
}
...
};
Segment implements front( )
, back( )
, size( )
, empty( )
, data( )
, begin( )
, end()
, cbegin( )
, cend()
, find(…)
and operator[](…)
which performs the operations you would normally expect.
The find(…)
function uses std::lower_bound
to locate the timeseries
point with the specified timestamp
, or the position in the segment where a timeseries
point with that timestamp
should be inserted:
iterator find( const DateTime& timestamp )
{
return std::lower_bound( begin( ), end( ), timestamp,
[]( const Point& point, const DateTime& timestamp )
{
return point.Timestamp( ) < timestamp;
} );
}
Here, I used the overload that allows us to specify the predicate (or comparison) used for the search, this demonstrates a nice feature of std::lower_bound
where I pass the timestamp
, not creating a temporary Point
object, to the function. The timestamp
becomes the second argument to the binary predicate. This is both simple and efficient.
The most interesting, and performance critical, function of the Segment
class is insert(…)
.
The timeseries
points inside a segment are sorted according to the timestamp
of the timeseries
point, and since this is a class that will be used for timeseries
data, the function tries to favor appends. The second parameter receives the overflow value, if any. This will only happen when the segment is full, and the timestamp
of the timeseries
point to insert does not match an existing timeseries
point in the segment, which would then be overwritten.
The function also tries to make prepends efficient, as an insert into a timeseries
in any but the last segment will cause overflow values to propagate from segment to segment, making this worth the effort.
Since the call to find(…)
will never be called for a timestamp
greater than or equal to the timestamp
of the last timeseries
point, we can skip testing it == end()
before using the iterator.
This is not particularly complex or advanced function, it just demonstrates that by adding a few extra steps, optimizing for the common cases, we can make significant performance gains:
void insert( const Point& point, std::optional<Point>& overflow )
{
if ( size_ )
{
Point& last = points_[size_ - 1];
if ( size_ < MaxSize )
{
if ( last.Timestamp( ) < point.Timestamp( ) )
{
points_[size_] = point;
size_++;
}
else if ( last.Timestamp( ) == point.Timestamp( ) )
{
points_[size_ - 1] = point;
}
else if ( point.Timestamp( ) < points_[0].Timestamp( ) )
{
std::copy_backward( begin(), end( ), end( ) + 1 );
points_[0] = point;
++size_;
}
else if ( point.Timestamp( ) == points_[0].Timestamp( ) )
{
points_[0] = point;
}
else
{
auto it = find( point.Timestamp( ) );
if ( it->Timestamp( ) > point.Timestamp( ) )
{
std::copy_backward( it, end( ), end( ) + 1 );
++size_;
}
*it = point;
}
}
else
{
if ( last.Timestamp( ) < point.Timestamp( ) )
{
overflow = point;
}
else if ( last.Timestamp( ) == point.Timestamp( ) )
{
points_[size_ - 1] = point;
}
else if ( point.Timestamp( ) < points_[0].Timestamp( ) )
{
overflow = last;
std::copy_backward( begin(), end( ) - 1, end( ) );
points_[0] = point;
}
else if ( point.Timestamp( ) == points_[0].Timestamp( ) )
{
points_[0] = point;
}
else
{
auto it = find( point.Timestamp( ) );
if ( it->Timestamp( ) > point.Timestamp( ) )
{
overflow = last;
std::copy_backward( it, end( ) - 1, end( ) );
}
*it = point;
}
}
}
else
{
points_[0] = point;
size_++;
}
}
I believe the overall performance of the timeseries
storage engine proves that these extra steps is well worth the extra consideration given to appends and prepends.
The End (for now)
This concludes the first article about the timeseries
storage engine. I am very open to suggestions to improvements to the API as long as they are not detrimental to the performance of the engine.
My plan for this series is to create a server on top of the engine, with a .NET client library and example web apps implemented on top of .NET Core.
So, until next time... happy coding!
History
- 5th October, 2020
- 6th October, 2020
- Bug fix + cleaning up most of the unit tests
- 7th October, 2020
- More unit tests for the
Harlinn.Common.Core
library
- 11th February, 2021
- Bug fixes
- C++ ODBC support
- Added the ability to create complex keys that can be sorted using
memcmp
, which is useful when working with LMDB
- 25th February, 2021
- Updated LMDB
- Updated xxHash
- Added the initial implementation of very fast hash based indexes for large complex keys using LMDB
- Fast asychronous logging - nearly done :-)
- 3rd March, 2021
- New authorization related classes
SecurityId
: Wrapper for SID and related operations ExplicitAccess
: Wrapper for EXCPLICIT_ACCESS
Trustee
: Wrapper for TRUSTEE
SecurityIdAndDomain
: Holds the result from LookupAccountName
LocalUniqueId
: Wrapper for LUID
AccessMask
: Makes it easy to inspect the rights assigned to an ACCESS_MASK
AccessMaskT<>
EventWaitHandleAccessMask
: Inspect and manipulate the rights of an EventWaitHandle
. MutexAccessMask
: Inspect and manipulate the rights of a Mutex. SemaphoreAccessMask
: Inspect and manipulate the rights of a Semaphore. WaitableTimerAccessMask
: Inspect and manipulate the rights of a WaitableTimer
. FileAccessMask
: Inspect and manipulate file related rights. DirectoryAccessMask
: Inspect and manipulate directory related rights. PipeAccessMask
: Inspect and manipulate pipe related rights. ThreadAccessMask
: Inspect and manipulate thread related rights. ProcessAccessMask
: Inspect and manipulate process related rights.
GenericMapping
: Wrapper for GENERIC_MAPPING
AccessControlEntry
: This is a set of tiny classes that wraps the ACE structures
AccessControlEntryBase<,>
AccessAllowedAccessControlEntry
AccessDeniedAccessControlEntry
SystemAuditAccessControlEntry
SystemAlarmAccessControlEntry
SystemResourceAttributeAccessControlEntry
SystemScopedPolicyIdAccessControlEntry
SystemMandatoryLabelAccessControlEntry
SystemProcessTrustLabelAccessControlEntry
SystemAccessFilterAccessControlEntry
AccessDeniedCallbackAccessControlEntry
SystemAuditCallbackAccessControlEntry
SystemAlarmCallbackAccessControlEntry
ObjectAccessControlEntryBase<,>
AccessAllowedObjectAccessControlEntry
AccessDeniedObjectAccessControlEntry
SystemAuditObjectAccessControlEntry
SystemAlarmObjectAccessControlEntry
AccessAllowedCallbackObjectAccessControlEntry
AccessDeniedCallbackObjectAccessControlEntry
SystemAuditCallbackObjectAccessControlEntry
SystemAlarmCallbackObjectAccessControlEntry
AccessControlList
: Wrapper for ACL PrivilegeSet
: Wrapper for PRIVILEGE_SET
SecurityDescriptor
: Early stage implementation of wrapper for SECURITY_DESCRIPTOR
SecurityAttributes
: Very early stage implementation of wrapper for SECURITY_ATTRIBUTES
Token
: Early stage implementation of wrapper for an access token DomainObject
User
: Information about a local, workgroup or domain user Computer
: Information about a local, workgroup or domain computer Group
: local, workgroup or domain group
Users
: Vector of User
objects Groups
: Vector of Group
objects
- 14th of March, 2021 - more work on security related stuff:
Token
: A wrapper for a Windows access token with a number of supporting classes like:
TokenAccessMask
: An access mask implementation for the access rights of a Windows access token. TokenGroups
: A wrapper/binary compatible replacement for the Windows TOKEN_GROUPS
type with a C++ container style interface. TokenPrivileges
: A wrapper/binary compatible replacement for the TOKEN_PRIVILEGES
type with a C++ container style interface. TokenStatistics
: A binary compatible replacement for the Windows TOKEN_STATISTICS
type using types implemented by the library such as LocalUniqueId
, TokenType
and ImpersonationLevel
. TokenGroupsAndPrivileges
: A Wrapper/binary compatible replacement for the Windows TOKEN_GROUPS_AND_PRIVILEGES
type. TokenAccessInformation
: A wrapper/binary compatible replacement for the Windows TOKEN_ACCESS_INFORMATION
type. TokenMandatoryLabel
: A wrapper for the Windows TOKEN_MANDATORY_LABEL
type.
SecurityPackage
: Provides access to information about a Windows security package. SecurityPackages
: An std::unordered_map
of information about the security packages installed on the system. CredentialsHandle
: A wrapper for the Windows CredHandle
type. SecurityContext
: A wrapper for the Windows CtxtHandle
type Crypto::Blob
and Crypto::BlobT
: C++ style _CRYPTOAPI_BLOB
replacement CertificateContext
: A wrapper for the Windows PCCERT_CONTEXT
type, provides access to a X.509 certificate. CertificateChain
: A wrapper for the Windows PCCERT_CHAIN_CONTEXT
type which contains an array of simple certificate chains and a trust status structure that indicates summary validity data on all of the connected simple chains. ServerOcspResponseContext
: Contains an encoded OCSP response. ServerOcspResponse
: Represents a handle to an OCSP response associated with a server certificate chain. CertificateChainEngine
: Represents a chain engine for an application. CertificateTrustList
: A wrapper for the Windows PCCTL_CONTEXT
type which contains both the encoded and decoded representations of a CTL. It also contains an opened HCRYPTMSG
handle to the decoded, cryptographically signed message containing the CTL_INFO
as its inner content. CertificateRevocationList
: Contains both the encoded and decoded representations of a certificate revocation list (CRL) CertificateStore
: A storage for certificates, certificate revocation lists (CRLs), and certificate trust lists (CTLs).
- 23rd of March, 2021:
- Updated to Visual Studio 16.9.2
- Build fixes
SecurityDescriptor
: Implemented serialization for security descriptors, enabling persistence of authorization data