(untagged)

Speed Dating - Using Time Compression Instead of MDX Cubes to Deliver Rapid Data Analysis

Sanjay_Bhatia

2 Apr 2012

With the advent of abundant memory, the majority of user requirements can be delivered without having to build separate data warehouses or using modeling technologies like MDX. This article shows how to use time compression to deliver rapid analysis of enormous data volumes using simple SQL views.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

For Ad Hoc reporting that is more than crystal clear, click here.

The term “data warehouse” may have originated because it once took a literal warehouse to house all the servers needed to store a few gigabytes of information. Now, we live in extraordinary times where usb drives are given away as promotional items which can store the equivalent amount of data. The approaches taken by data professionals have not caught up to the unfathomable progress hardware has made. This article presents a simple approach to leverage all this horsepower in order to provide rapid analysis of terabytes of information with relatively little effort. If you are on relatively new hardware, this will work on your existing SQL database without the need for any additional har dware, software, ETL or replication technologies.

OLAP vs. OLTP on SQL Databases

Traditional systems architects liked to separate analytical OLAP queries and OLTP transactional queries. This was often necessary due to the limitations of hardware and software. While many third party frameworks exist for analyzing data, the SQL language can actually handle the bulk of business requirements needed by the average knowledge worker. This approach is generally called Relational OLAP or ROLAP. There is no need to copy the data into a separate piece of s oftware such as MDX for the majority of today's business requirements and databases.

Speed Dating

While we are privileged to live in a world of cheap, abundant memory, we also have the responsibility of having to deal with vast quantities of data. We are going to use date aggregation to compress data so it can fit in system memory, which is still many orders of magnitude faster than spinning hard disks. With 64-bit systems now commonplace, a system capable of addressing more than 32 GB of memory can be had for a few thousand dollars. Using date based aggregation, the compressed results can fit into system memory.

The Temptation of a Faster Model

Faced with the challenges of hiding users from database complexity many BI vendors and consultants decided that the boring old SQL model wasn’t jazzy enough to justify their price points. They came up with proprietary modeling technologies that maximized query performance as well as billable hours. Today MDX has emerged as a standard way to model complex data. While MDX is a wonderful technology, very few IT folk have the patience or the mental fortitude to master MDX. If you have one of these gems, treat them well because they are expensive and very hard to replace. Cubes technology was the only option when servers came with as little as 32 MB of memory while today's systems have nearly a thousand times that amount available. Now that commodity h ardware can run billions of records in seconds without specialized hardware, much of the appeal of Cubes technology gets eliminated while the enormous costs make the investment questionable.

Speed Dating Details

Doing aggregate compression will accomplish the same results as using MDX or cubes technologies but with significantly less investment in time and resources. Rather than using complex data models to cache data in lots of specific ways that have to be continually defined and maintained, we make a grand overarching aggregation. We pick a level of granularity (such as days or hours) and we compress all our reportable data the same way. I have many thousands of transactions each day, but may only care about reporting by days, months or quarters. If I aggregate down to the daily level, an excellent reporting front end like Izenda Reports can let users visualize and navigate the information easily. This code shows how you would apply a yearly granularity to all your da tes.

DATEPART(year, OrderDate) AS Year

Since a year is usually not granular enough, but a day is, we’ll have to do a little bit more refinement. Since none of the DATEPART methods support getting the entire date without the time, we do a little trick using DATEDIFF and DATEADD. We don’t want anything that is smaller than our daily granularity as that would reduce how much we compress the data through aggregation. Note that this approach does not compress the actual data. Instead it creates a view which takes up a lot less space and can potentially be a big help in memory.

DATEADD(D, 0, DATEDIFF(D, 0, OrderDate))

Writing the SQL Code

Let’s use the Northwind orders table as an example. We’ll create a view off this table to simulate a much larger data set (see other article). We create a view to aggregate all of the records down to a single row for each dimension for each day.

So the original data looks like this. We have ShipCountry and ShipCity as geographic dimensions, OrderDate as a temporal dimension and OrderID and Freight as attributes which we will turn into measures.

Izenda-Speed-Dating/OrdersOriginalData.png

We'll use the following SQL to create a view that does temporal compression on our OrderDate field. The DATEADD is needed because there is no DATEPART format that does the date without the time. If we averaged 100 orders per customer, per month, this would give you 99% data compression and would allow a 1TB table to squeeze into 10 GB of memory. A system with more than 10 GB of memory could essentially cache an indexed or materialized view in memory to deliver exceptional performance.

CREATE VIEW OrdersView
AS SELECT ShipCountry,
ShipCity ,
DATEADD(D, 0, DATEDIFF(D, 0, OrderDate)),
COUNT(OrderID), SUM(Freight) FROM dbo.Orders
GROUP BY ShipCountry, ShipCity, CAST(FLOOR( CAST( OrderDate AS FLOAT ) )AS DATETIME)

The results look like this and can serve as a high performance data source for reporting tools like Izenda Reports.

Izenda-Speed-Dating/OrderDateAggregateResults.png

A self-service tool like Izenda Reports could then power an advanced high performance

We now have a much smaller data set to work with and our indexed view, combined with gigabytes of memory, will deliver high performance real-time results on this data with daily granularity. If each customer places 100 orders per day, we’ve reduced our working set 99%. A full TB of data could be cached and rapidly analyzed in 10 GB of system memory. We can now give users access to it with a self-service reporting tool like Izenda and go on vacation!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here