Introduction
I've always liked tree structures. I have worked with them in many situations, and I have searched efficient ways to manage them in SQL Server. As the limit of 32 nested levels in the SQL Server (both 2000 and 2005 versions) impedes us to attack this issue recursively, we have to allot an overhead in storing and accessing nested structures in database tables. In this article, I will try to explain how to efficiently store a tree structure in a SQL Server table, access it easily from T-SQL and clients, and transform it in to a structure recognized by OLAP Dimensions in Microsoft Analysis Services.
The question is how to efficiently and easily store and access tree structures, and how to transform them as in the pictures below:
Tree structure storage
The well-known method to store a tree in a relational table is to have a link between a node and its parent. So, the ID
and P_ID
fields are enough for this approach, where P_ID
is NULL
for root nodes. The access method for this structure could be: storing the absolute path (starting from the root) for a node, storing the left ID and right ID in a different table if the structure is a binary tree, storing once again the ID
and P_ID
fields values and a LEVEL
value in another table, storing left and right index values for every node in the same table etc. In my opinion, the most efficient way is the last one, because the access will be very easy using just a simple SELECT
statement and, for large structures, the storage size is minimum. More information about trees can be obtained from the CodeProject article, General trees persisted in relational databases, and the MSDN technical article "Hierarchical Data and Scope Checking".
Using the last technique, we will have two supplementary columns in the table: NLEFT
, storing values for the left index, and NRIGHT
for right index. We have to carefully set the appropriate values for these columns, because the "left" value for a node must always be between the "left" value and the "right" value of its ancestors. Having these conditions, a simple self join for the table with a filter for a node will return the entire hierarchy:
SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1
The CREATE TABLE
statement is simple:
CREATE TABLE TREE (
ID int NOT NULL ,
P_ID int NULL ,
NAME varchar (100),
NLEFT int NOT NULL CONSTRAINT DF_TREE_NLEFT DEFAULT (-1),
NRIGHT int NOT NULL CONSTRAINT DF_TREE_NRIGHT DEFAULT (-1),
NLEVEL int NULL CONSTRAINT DF_TREE_NLEVEL DEFAULT (-1),
CONSTRAINT PK_TREE PRIMARY KEY CLUSTERED (ID)
)
I have added the NLEVEL
column to store the level of the nodes useful in further needs. The level starts from 1 for the root.
The access will not be so difficult, but how can we set the appropriate values for NLEFT
and NRIGHT
? I have chosen the trigger implementation, to be sure that every time a node is inserted, the hierarchical structure is stored correctly. We could update NLEFT
and NRIGHT
values on deletion (even that is not necessary), and if the P_ID
field value is changed.
I present only the trigger for insert action, but it can be done in the same manner for all update P_ID
s and delete as well. The trigger checks if the new inserted node is a root or a descendant node. If it is a root, it generates a new NLEFT
value (it allocates a maximum of 10000 values per hierarchy) and sets the NLEVEL
to 1. If it is a child node, set values depending on if the node is the first (there are no other children on the same level) or last (there are other children for corresponding levels and the parent) child added:
IF @P_ID IS NULL
BEGIN
SELECT @MAX_VALUE = ISNULL(MAX(NLEFT), 0) + 10000
FROM TREE WHERE P_ID IS NULL
SELECT @NLEVEL = 1
END
ELSE
BEGIN
SELECT @MAX_VALUE = ISNULL(MAX(NRIGHT), 0)
FROM TREE WHERE P_ID = @P_ID AND ID < @ID
SELECT @NLEVEL = ISNULL(NLEVEL, 0)
FROM TREE WHERE P_ID = @P_ID AND ID < @ID
IF @MAX_VALUE = 0
BEGIN
SELECT @MAX_VALUE = ISNULL(NLEFT, 0) FROM TREE WHERE ID = @P_ID
SELECT @NLEVEL = ISNULL(NLEVEL, 0) + 1 FROM TREE WHERE ID = @P_ID
END
END
UPDATE TREE SET NLEFT = @MAX_VALUE + 1,
NRIGHT = @MAX_VALUE + 2,
NLEVEL = @NLEVEL WHERE ID = @ID
After the values are stored, the trigger ensures that the corresponding ancestors have appropriate NLEFT
and NRIGHT
values. This is done from the root to the leaves, in Depth-First Traversal. First, it identifies the root node, and then it applies the recalculation algorithm:
IF @P_ID IS NOT NULL
BEGIN
SELECT @ROOT_ID = ISNULL(P.ID, -1)
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE C.ID = @P_ID AND P.P_ID IS NULL
EXEC TREE_RECALC_ITER @ROOT_ID
END
The recalculation or numbering algorithm is:
- get the
NLEFT
value of the root;
- move down the hierarchy, and find the first child (from left to right) of the topmost parent, and give it a
NLEFT
value of 2;
- continue down the hierarchy, setting the
NLEFT
value with last NLEFT
+ 1, and if a leaf node is found, count its NLEFT
and NRIGHT
values (NRIGHT
= NLEFT
+ 1);
- continue this numbering for all next possible child nodes at the current level of the hierarchy;
- after all leaf children nodes are numbered, go back to their parent, and set its
NRIGHT
values with the maximum NRIGHT
children value + 1;
- continue numbering the siblings for the current node;
- continue walking the hierarchy until all nodes are numbered;
The algorithm is implemented in two manners: recursive and iterative.
The recursive method is simple, and it is found in the TREE_RECALC_REC
stored procedure. The stored procedure checks if the node currently processed has children. If it has, apply itself for all the children, setting the @NRIGHT
output parameter with the corresponding value. If the node has no children (it is a leaf node), simply set the NLEFT
and NRIGHT
values:
SELECT @NCOUNT = COUNT(*), @NLEVEL2 = @NLEVEL + 1
FROM TREE
WHERE P_ID = @NPARENT
IF @NCOUNT > 0
BEGIN
SELECT @NNEWLEFT = @NLEFT + 1
SELECT @NTEMPID = 0
SELECT @NTEMPID = MIN(ID), @NCOUNTER = COUNT(*)
FROM TREE
WHERE P_ID = @NPARENT AND ID > @NTEMPID
WHILE (@NCOUNTER > 0)
BEGIN
EXEC TREE_RECALC_REC @NNEWLEFT,
@NTEMPID, @NLEVEL2, @NRIGHT OUTPUT
SELECT @NNEWLEFT = @NRIGHT + 1
SELECT @NTEMPID = MIN(ID), @NCOUNTER = COUNT(*)
FROM TREE
WHERE P_ID = @NPARENT AND ID > @NTEMPID
END
UPDATE TREE SET NLEFT = @NLEFT,
NRIGHT = @NRIGHT + 1, NLEVEL = @NLEVEL
WHERE ID = @NPARENT
SELECT @NRIGHT = @NRIGHT + 1
END
ELSE
BEGIN
SELECT @NRIGHT = @NLEFT + 1
UPDATE TREE SET NLEFT = @NLEFT,
NRIGHT = @NRIGHT,
NLEVEL = @NLEVEL WHERE ID = @NPARENT
END
As it is known, the 32 nested levels limit forces us to not store more than 32 levels on a hierarchy.
The iterative manner solves the 32 nested levels limit issue. The workaround is to use a "stack" as it is explained in the Expanding Hierarchies MSDN article. The "stack" is a temporary table which stores all the nodes which belong to the same parent, starting with the root. The nodes are processed in a loop using a level variable, and when there is no node with the current level in the "stack", the level is decreased with 1 (the previous parent level is processed). The level variable takes values form 1 to MAX(hierarchy level)
. The same algorithm may be applied for a child when you want to retrieve its ancestors, but the level variable will start from the child level to 1. When the last leaf node in a sub-hierarchy (the node with the greatest ID
value on its level and parent) is processed, the numbering algorithm should be applied over the ancestors. For this, we will need another "stack" table which allows numbering from child to ancestors. The implementation takes much overhead indeed, but the nested levels are limitless. The stored procedure TREE_RECALC_ITER
contains the iterative recalculation algorithm:
WHILE @NLEVEL > 0
BEGIN
IF EXISTS(SELECT * FROM #TPC WHERE NLEVEL = @NLEVEL)
BEGIN
SELECT TOP 1 @ID = ID, @P_ID = P_ID,
@NAME = NAME FROM #TPC WHERE NLEVEL = @NLEVEL ORDER BY ID
INSERT INTO #T VALUES(@ID, @P_ID, @NAME, @NLEFT, @NRIGHT, @NLEVEL)
DELETE FROM #TPC WHERE NLEVEL = @NLEVEL AND ID = @ID
INSERT INTO #TPC SELECT ID, P_ID, NAME,
@NLEVEL + 1 FROM TREE WHERE P_ID = @ID
IF @@ROWCOUNT > 0
SET @NLEVEL = @NLEVEL + 1
IF EXISTS(SELECT ID FROM TREE WHERE P_ID = @ID)
BEGIN
SET @NLEFT = @NLEFT + 1
SET @NRIGHT = @NLEFT + 1
END
ELSE
BEGIN
IF @ID = (SELECT MAX(ID) FROM TREE WHERE
NLEVEL = @NLEVEL AND P_ID = @P_ID)
BEGIN
PRINT LTRIM(RTRIM(STR(@ID))) + ' ' +
@NAME + ' ' + LTRIM(RTRIM(STR(@NRIGHT)))
SET @NLEVEL2 = 1
TRUNCATE TABLE #TCP
INSERT INTO #TCP SELECT ID, P_ID, NAME,
@NLEVEL2 FROM #T WHERE ID = @P_ID
WHILE @NLEVEL2 > 0
BEGIN
IF EXISTS(SELECT * FROM #TCP WHERE NLEVEL = @NLEVEL2)
BEGIN
SELECT TOP 1 @ID2 = ID, @P_ID2 = P_ID,
@NAME2 = NAME FROM #TCP
WHERE NLEVEL = @NLEVEL2 ORDER BY ID
SET @NRIGHT = @NRIGHT + 1
UPDATE #T SET NRIGHT = @NRIGHT WHERE ID = @ID2
DELETE FROM #TCP WHERE NLEVEL = @NLEVEL2 AND ID = @ID2
INSERT INTO #TCP SELECT ID, P_ID, NAME,
@NLEVEL2 + 1 FROM #T WHERE ID = @P_ID2
IF @@ROWCOUNT > 0
SET @NLEVEL2 = @NLEVEL2 + 1
END
ELSE
BEGIN
SET @NLEVEL2 = @NLEVEL2 - 1
END
END
END
ELSE
BEGIN
SET @NLEFT = @NRIGHT + 1
SET @NRIGHT = @NLEFT + 1
END
END
END
ELSE
BEGIN
SET @NLEVEL = @NLEVEL - 1
SELECT @NLEFT = MAX(NRIGHT) + 1 FROM #T WHERE NLEVEL = @NLEVEL
SET @NRIGHT = @NLEFT + 1
END
END
Tree structure access
The hierarchy can now be accessed easily using simple SQL statements:
- select an entire hierarchy:
SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1
- select all the ancestors for a node:
SELECT P.ID, P.NAME, P.P_ID
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE C.ID = 5
- select all the leaf nodes in a hierarchy:
SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1 AND C.NRIGHT - C.NLEFT = 1
- retrieve all the non-leaf nodes in a hierarchy:
SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1 AND C.NRIGHT – C.NLEFT <> 1
- select an entire structure in XML mode (with two nested levels, as it is implemented in the
TREE_GET_XML
stored procedure):
SELECT
1 AS TAG
, NULL AS PARENT
, '' AS [root!1]
, NULL AS [node!2!id]
, NULL AS [node!2!p_id]
, T.NAME AS [node!2!name]
FROM TREE T
WHERE T.ID = @ROOT_ID
UNION
SELECT 2, 1, NULL, C.ID, C.P_ID, C.NAME
FROM TREE P
INNER JOIN TREE C ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = @ROOT_ID
FOR XML EXPLICIT
The last XML output is not so useful for the clients. They probably need the underlying tree structure which SQL Server XML features can't obtain, and not this "fake" ID-parent ID implementation. So, the next idea will be to re-arrange the "fake" tree structure in a XML nested hierarchy. This will be done using the TREE_GET_XML
stored procedure, an XML web service, and a general XSL transform stylesheet. Using id
and p_id
attributes, the stylesheet will put the nodes in a proper tree order. The stylesheet is incredibly simple. It applies a template starting from the root node (//*[not(@p_id)]
), and this template applies itself recursively on all the nodes that has the p_id
attribute value equal to its id
attribute value (//*[@p_id = $id]
):
="1.0"="UTF-8"
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<xsl:apply-templates
select="//*[name() = 'node' and not(@p_id)]"/>
</xsl:template>
<xsl:template match="*">
<xsl:variable name="id" select="@id"/>
<xsl:element name="node">
<xsl:apply-templates select="@*"/>
<xsl:apply-templates select="//*[@p_id = $id]"/>
</xsl:element>
</xsl:template>
<xsl:template match="@*">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
The tree_util
XML web service contains two web methods:
GetDirectTree
– returns the XML output as it is gathered by TREE_GET_XML
;
GetStructuredTree
– returns the XML output gathered by TREE_GET_XML
and transformed in an underlying tree structure.
Tree to OLAP transformation
Suppose that this tree structure will serve as a data source for an OLAP cube dimension. The dimensions and the cube can be created easily using Analysis Services. But what happens when the tree structure has changed, having one level more? Is it enough just to refresh the dimensions and the cube? The answer is no, the cube will not reflect the changes.
Let's imagine that we have a data cube which has a TREE_OLAP
dimension (doesn't matter what it contains – countries, regions, and cities; departments; geographical areas etc.) and a TIME
dimension. A problem that I consider important in OLAP cubes creation and update is, what happens if the main structure of a dimension is changing and I have to add or delete some levels? How easily can I update the dimension and the cube? And, if possible, do that automatically.
TREE_OLAP
dimension has as data source the TREE_OLAP
table which contains the tree nodes in a totally different structure. The ID-parent ID tree implementation doesn't help the OLAP dimensions to get the data and to process it. They need a more redundant structure to arrange tree nodes on different levels. They need a column in the data source for every dimension level. The structure they need is the one shown in the second picture at the beginning of this article.
The TREE_OLAP
table must be recreated every time the TREE
hierarchy is changed. The columns in this table are the tree ID
taken from the TREE
table and, for every level of the hierarchy, a varchar
column named "N", and the level value (e.g.: if the tree structure has three levels, the TREE_OLAP
table will have 4 columns: ID
int
, N1
varchar(100)
, N2
varchar(100)
, and N3
varchar(100)
).
The TREE_OLAP
table is re-created and filled with the appropriate values using the TREE_2_OLAP
stored procedure. The procedure checks the maximum level of the hierarchy and re-creates the TREE_OLAP
table. To populate this table, it must traverse the tree in a similar manner as it performs the TREE_RECALC_ITER
stored procedure (with two "stack" temporary tables). For the currently processed node, the procedure inserts a record in the TREE_OLAP
table at the corresponding "N" + LEVEL
column:
SET @SQL = N'INSERT INTO TREE_OLAP VALUES(@ID'
SET @I = 1
WHILE @I <= @MAX_LEVEL
BEGIN
IF @I = @NLEVEL
SET @SQL = @SQL + N', @NAME'
ELSE
SET @SQL = @SQL + N', ''<none>'''
SET @I = @I + 1
END
SET @SQL = @SQL + N')'
EXEC sp_executesql @SQL, N'@ID INT,
@NAME VARCHAR(100)', @ID = @ID, @NAME = @NAME
After that, for each ancestor, perform the update for the inserted record on the corresponding ancestor level column:
SET @SQL = N'UPDATE TREE_OLAP SET N' +
LTRIM(RTRIM(STR(@NLEVEL2))) + ' = @NAME WHERE ID = @ID'
EXEC sp_executesql @SQL, N'@ID INT, @NAME VARCHAR(100)',
@ID = @ID, @NAME = @NAME2
Creating the Sales cube and dimensions
If we add some records in the TREE
table, add the daily records for 2005 and 2006 in the TIME_BY_DAY
table, and import data from the sales.txt file (which contains the TREE_ID
, TIME_ID
, and SALES_VALUE
fields) into the SALES
table, we can create the Sales
cube. The sales.txt file contains records related only for 2005.
The time dimension is general, and implemented as it is in the Analysis Services "FoodMart 2000" sample database. The fnIsLeapYear
, fnGetDaysNumber
functions and the TIME_BY_DAY_GEN
stored procedure generates daily records for a specified year in the TIME_BY_DAY
table which will serve as the data source for the TIME
dimension. The TREE_OLAP
dimension is based on a star schema (only one table per dimension). The fact table called SALES
has the following structure: ID
int
, TREE_ID
int
(foreign key to the ID
column in the TREE
table), TIME_ID
int
(foreign key to the ID
column in the TIME_BY_DAY
table), SALES_VALUES
float
(the measure column).
The cube data visualization will be:
The RefreshSalesCube application
The changes in the dimension and in the cube database can be done manually, using Analysis Services user interfaces, and programmatically using the DSO (Decision Support Objects – interop assembly) namespace for .NET framework 1.1, and AMO namespace for .NET framework 2.0.
The RefreshSalesCube
is a simple console application which:
- connects to the local Analysis Server:
DSO.ServerClass dsoServer = new ServerClass();
dsoServer.Connect("localhost");
- finds the
TREE_DB
database:
DSO.OlapCollection coll = (DSO.OlapCollection)dsoServer.MDStores;
DSO.Database dsoDB = (DSO.Database)coll.Item("TREE_DB");
- retrieves the
Sales
data cube:
coll = (DSO.OlapCollection)dsoDB.MDStores;
DSO.Cube dsoCube = (DSO.Cube)coll.Item("Sales");
- retrieves the
TREE_OLAP
dimension:
coll = (DSO.OlapCollection)dsoDB.Dimensions;
DSO.Dimension dsoDim = (DSO.Dimension)coll.Item("TREE_OLAP");
- re-creates and re-processes the dimension, detaching the shared dimension from the cube, removing dimension levels, and re-creating the new levels based on the columns of the new
TREE_OLAP
table:
dsoCube.Dimensions.Remove(dsoDim.Name);
int i = dsoDim.Levels.Count;
while(i > 0)
{
dsoDim.Levels.Remove(i);
i--;
}
SqlConnection cnn = new SqlConnection(
ConfigurationSettings.AppSettings["ConnectionString"]);
cnn.Open();
SqlDataAdapter da = new SqlDataAdapter("SELECT TOP" +
" 0 * FROM TREE_OLAP", cnn);
DataTable dt = new DataTable();
da.Fill(dt);
cnn.Close();
int nColumns = dt.Columns.Count;
int adVarChar = 200;
for(i = 1; i < nColumns; i++)
{
DSO.Level lvl = (DSO.Level)dsoDim.Levels.AddNew(
dt.Columns[i].ColumnName,
DSO.SubClassTypes.sbclsRegular);
lvl.MemberKeyColumn = "\"dbo\".\"TREE_OLAP\".\"" +
dt.Columns[i].ColumnName + "\"";
lvl.ColumnType = (short)adVarChar;
lvl.ColumnSize = 100;
}
dsoDim.Process(DSO.ProcessTypes.processFull);
- re-processes the cube, adding the dimension and re-creating the join between the fact table and the dimension tables:
DSO.Dimension dim = (DSO.Dimension)
dsoCube.Dimensions.AddNew("TREE_OLAP",
DSO.SubClassTypes.sbclsRegular);
dim = dsoDim;
dsoCube.JoinClause = "\"dbo\".\"Sales\".\"TREE_ID\" =" +
" \"dbo\".\"TREE_OLAP\".\"ID\" AND" +
" \"dbo\".\"Sales\".\"TIME_ID\" =" +
" \"dbo\".\"TIME_BY_DAY\".\"ID\"";
dsoCube.Update();
dsoCube.Process(DSO.ProcessTypes.processFull);
Updating the Sales cube and the TREE_OLAP dimension
Let's add a new record in the TREE
table, which will increase the level of the hierarchy:
INSERT INTO TREE(ID, P_ID, NAME) VALUES(21, 20, 'MamaW')
The new record will have a correspondent in the fact table, in a day in 2006, let's say, '2006-07-18', with a value for SALES_VALUES
of 1000:
DECLARE @TIME_ID INT
SELECT @TIME_ID = ID FROM TIME_BY_DAY WHERE T_DATE = '2006-07-18'
INSERT INTO SALES VALUES(21, @TIME_ID, 1000)
We can just re-create the TREE_OLAP
table:
EXEC TREE_2_OLAP 1
and run the RefreshSalesCube application.
The new cube data visualisation will reflect the changes without a manual update:
Conclusion
Using the implementation on the OLTP database, and a few lines of code to re-create and re-process the cube, the structure transformation and data changing is transparent for the reporting clients which consume the cube data. The process can be automated and scheduled using SQL Server jobs, and the time and work performance can be improved.