Storing Time Series in Microsoft Azure Table Storage

Cybermaxs

5.00/5 (4 votes)

5 May 2015CPOL6 min read

15.1K

How to store time series in Microsoft Azure Table Storage

This article is an entry in our Microsoft Azure IoT Contest. Articles in this section are not required to be full articles so care should be taken when voting.

In a recent project, I needed to store time series into Microsoft Azure Table Storage. As for most of NoSQL databases, the design is very important and you have to mainly think about your access patterns.

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Time series data often arise when monitoring an application or tracking business metric, but occur naturally in many application areas like finance, economics, medicine …

Time series are very frequently plotted via –beautiful- charts. Here is an example on the availability of Application Insight. In this example, the end user can view response time for several time ranges. Data is aggregated depending on the selected time range (every second for last hour, every minute for last day, every 5 minutes for last 48 hours, …) to keep it easy to understand.

If you’re unfamiliar with Table Service on Azure, I highly recommend you read the Introduction, then the Design Guide; after reading both articles, I hope you will understand why storing time series is not so easy and needs this kind of article.

To illustrate the following designs, I use random generated data, basically one point/second. Having one data point per second makes it very clear and easy to read. In your implementation, you can have more than one point per second; if doesn’t matter.

Basic Design

A first design could be to create a basic entity for each data point, like this one:

public class DataPoint : TableEntity
{
    public long Value { get; set; }
    public DataPoint() { }
    public DataPoint(string source, long timeStamp, long value)
    {
        this.PartitionKey = source;
        this.RowKey = timeStamp.ToString();
        this.Value = value;
    }
}

Why a timestamp for the time property? It’s simply a better choice because RowKey is a string (constraint by Table Service). Ticks could be an idea but I prefer a common value (Unix timestamp) for time axis, so any client could request this table via REST endpoint.

This will produce this kind of results:

The RowKey equals here to the Unix timestamp in seconds (due to my test data) but it could be the traditional Unit timestamp (ms). Now if you want to query entities for a specific time range (two hours), you need to use a range query on the RowKey. This is the 2^nd most efficient query type on table storage. Here is an example for the first two hours of 2015:

// time range : first two hours of 2015
var from = new DateTime(2015, 1, 1, 0, 0, 0);
var to = new DateTime(2015, 1, 1, 2, 0, 0);

// generate the range query
string filter = TableQuery.CombineFilters(
    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "mysource"),
    TableOperators.And,
    TableQuery.CombineFilters(
        TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.GreaterThanOrEqual, 
        ToSecondsTimestamp(from).ToString()),
        TableOperators.And,
        TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.LessThan, 
        ToSecondsTimestamp(to).ToString()))
        );

TableQuery<DataPoint> query = new TableQuery<DataPoint>().Where(filter);
var results = table.ExecuteQuery<DataPoint>(query);

The generated filter is:

$filter= (PartitionKey eq ‘mysource’) and ((RowKey ge ‘1420070400’) and (RowKey le ‘1420077600’))

This design seems to be good but it isn’t for real world scenarios. First, the rule #1 of Table Service (scalability) is not followed here. Performance is not the same for 1 000 rows and 1 000 000 000 rows. Having a hot partition (always the same PartitionKey) is an anti-scalability on Table Service. Second, the number of data points included in the desired time range affects performance: in my example, for 2 hours, we will receive 7200 entities (one entity for every second). The REST service limits the number of returned entities to 1000 per request. So, we will have to execute several requests to get all entities for the selected time range. Fortunately, the .NET will silently do this job for us but it’s not the case for all clients. And what about last 24 hours ? What about data every ms?

Pros	Cons
Natural representation	No Scalability (hot partition)
Easy to query with Range queries	Inefficient queries (limited to 1000 entities per request)
	Slow inserts (limited to 100 points by EGT)

A better approach could be to store row in reverse order. New data points are automatically added at the beginning of the table. Queries should be a little more efficient, but not so many in facts.

Advanced Design

The key idea with time series is that one dimension is well-known: the time. Every request contains a lower bound (from) and an upper bound (to). So, we will use the Compound key pattern to compute smart Partition & Row Keys thus enabling a client to lookup related data with an efficient query. A very common technique when working with time series is to round the time value with a magic factor. To fully understand these magic factors, you need one of my gists. ToRoundedSecondsTimestamp() is heavily used in the my examples. For example Tuesday, 28 April 2015 12:04:35 UTC could be rounded to:

Factor	Rounded timestamp	Datetime
1 (no factor)	1430222735	Tuesday, 28 April 2015 12:04:35 UTC
60 (one hour)	1430222700	Tuesday, 28 April 2015 12:04:00 UTC
3600 (one hour)	1430222400	Tuesday, 28 April 2015 12:00:00 UTC
86400 (one day)	1430179200	Tuesday, 28 April 2015 00:00:00 UTC

In this design, each row contains more than one data points; basically all points included in the current step (between the current rounded timestamp and the next rounded timestamp). This is possible thanks to DynamicTableEntity in Table Service. Just to illustrate the concept, here is an example with PartitionFactor = 60 secs and RowFactor = 5 secs.

Notice how PartitionKey/RowKey are now rounded timestamps. This is exactly the same data in my first design. In this example, each row contains 5 data points and each partition contains 12 rows. Of course, in a real scenario, our objective is to maximize the number of points in a single row.

To get the first two hours of 2015, the query is a little bit more complicated but not less efficient.

var from_ts = ToRoundedSecondsTimestamp(new DateTime(2015, 1, 1, 0, 0, 0), RowFactor);
var to_ts = ToRoundedSecondsTimestamp(new DateTime(2015, 1, 1, 2, 0, 0).AddSeconds(-1), RowFactor);

// generate all row keys in the time range
var nbRows = (to_ts - from_ts) / RowFactor + 1;
var rowKeys = new long[nbRows];

for (var i = 0; i < nbRows; i++)
{
    rowKeys[i] = from_ts + i * RowFactor;
}

// group row keys by partition
var partitionKeys = rowKeys.GroupBy(r => ToRoundedTimestamp(r, PartitionFactor));

var partitionFilters = new List<string>();
foreach (var part in partitionKeys)
{
    // PartitionKey = X and (RowKey=Y or RowKey=Y+1 or RowKey=Y+2 ...)
    string partitionFilter = TableQuery.GenerateFilterCondition
    ("PartitionKey", QueryComparisons.Equal, part.Key.ToString());
    string rowsFilter = string.Join(" " + TableOperators.Or + " ", 
    part.Select(r => TableQuery.GenerateFilterCondition
    ("RowKey", QueryComparisons.Equal, r.ToString())));
    string combinedFilter = TableQuery.CombineFilters(partitionFilter, TableOperators.And, rowsFilter);
    partitionFilters.Add(combinedFilter);
}

// combine all filters
string final = string.Join(" " + TableOperators.Or + " ", partitionFilters);
var query = new TableQuery<DynamicTableEntity>().Where(final);
var res = table.ExecuteQuery(query);

//do something with results...

The generated filter (with partition Factor = one hour, row Factor =4 minutes) is:

$filter= (PartitionKey eq ‘1420070400’) and (RowKey eq ‘1420070400’ or RowKey eq ‘1420070640’ 
or RowKey eq ‘1420070880’ or RowKey eq ‘1420071120’ or RowKey eq ‘1420071360’ or RowKey eq ‘1420071600’ 
or RowKey eq ‘1420071840’ or RowKey eq ‘1420072080’ or RowKey eq ‘1420072320’ or RowKey eq ‘1420072560’ 
or RowKey eq ‘1420072800’ or RowKey eq ‘1420073040’ or RowKey eq ‘1420073280’ or RowKey eq ‘1420073520’ 
or RowKey eq ‘1420073760’) or (PartitionKey eq ‘1420074000’) and (RowKey eq ‘1420074000’ 
or RowKey eq ‘1420074240’ or RowKey eq ‘1420074480’ or RowKey eq ‘1420074720’ or RowKey eq ‘1420074960’ 
or RowKey eq ‘1420075200’ or RowKey eq ‘1420075440’ or RowKey eq ‘1420075680’ 
or RowKey eq ‘1420075920’ or RowKey eq ‘1420076160’ or RowKey eq ‘1420076400’ 
or RowKey eq ‘1420076640’ or RowKey eq ‘1420076880’ or RowKey eq ‘1420077120’ or RowKey eq ‘1420077360’)

So, what are good PartitionKey/RowKey factors? It’s up to you and depends on your context but there are two constraints. First, a table service entity is limited to 1 MB with a maximum of 255 properties (including the PartitionKey, RowKey, and Timestamp). If you have data every second, 4 mins (60*4 = 240) seems to be a good RowKey Factor, but if you have several points per second it won’t be pertinent. The second problem is the filter length. Having too many partitions and rows will create a very long filter that can be rejected by the table service (HTTP 414 “Request URI too long”). In this case, you can execute partitions scans by removing row keys in the produced filter but this will be less efficient (depends on the number of rows in each partition)

Pros	Cons
Highly scalable	Point queries are not always possible (Uri length limits)
Fast inserts (if required)	Granularity should be defined at the beginning (how many rows/partition ? how many points ?)
	Dev tools can’t be used with dynamic columns

To conclude, Microsoft Azure Table Storage is a fast and very powerful service that you should not forget to use for your application, even if they are not hosted on Azure. However, you should use proper designs for your tables, very far from traditional relational designs. Choosing a good couple of PartitionKey/RowKey is very important. We covered in this article a simple use case (time series) but there are so many scenarios… In terms of pricing, it’s very cheap; but your choice should not be driven by costs here, but by scalability and performance.

Source code is available here: Basic and Advanced

CodeProject

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)