Tree utilities in SQL Server 2000 and 2005, and OLAP implementations

Dan Radu

4.72/5 (25 votes)

19 Jul 2006CPOL11 min read

3.3K

This article describes how to efficiently store and access tree structures in a SQL Server database, and how to use them in OLAP implementations.

Introduction

I've always liked tree structures. I have worked with them in many situations, and I have searched efficient ways to manage them in SQL Server. As the limit of 32 nested levels in the SQL Server (both 2000 and 2005 versions) impedes us to attack this issue recursively, we have to allot an overhead in storing and accessing nested structures in database tables. In this article, I will try to explain how to efficiently store a tree structure in a SQL Server table, access it easily from T-SQL and clients, and transform it in to a structure recognized by OLAP Dimensions in Microsoft Analysis Services.

The question is how to efficiently and easily store and access tree structures, and how to transform them as in the pictures below:

Tree structure storage

The well-known method to store a tree in a relational table is to have a link between a node and its parent. So, the ID and P_ID fields are enough for this approach, where P_ID is NULL for root nodes. The access method for this structure could be: storing the absolute path (starting from the root) for a node, storing the left ID and right ID in a different table if the structure is a binary tree, storing once again the ID and P_ID fields values and a LEVEL value in another table, storing left and right index values for every node in the same table etc. In my opinion, the most efficient way is the last one, because the access will be very easy using just a simple SELECT statement and, for large structures, the storage size is minimum. More information about trees can be obtained from the CodeProject article, General trees persisted in relational databases, and the MSDN technical article "Hierarchical Data and Scope Checking".

Using the last technique, we will have two supplementary columns in the table: NLEFT, storing values for the left index, and NRIGHT for right index. We have to carefully set the appropriate values for these columns, because the "left" value for a node must always be between the "left" value and the "right" value of its ancestors. Having these conditions, a simple self join for the table with a filter for a node will return the entire hierarchy:

SQL

SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1

The CREATE TABLE statement is simple:

SQL

CREATE TABLE TREE (
    ID int NOT NULL ,
    P_ID int NULL ,
    NAME varchar (100),
    NLEFT int NOT NULL CONSTRAINT DF_TREE_NLEFT DEFAULT (-1),
    NRIGHT int NOT NULL CONSTRAINT DF_TREE_NRIGHT DEFAULT (-1),
    NLEVEL int NULL CONSTRAINT DF_TREE_NLEVEL DEFAULT (-1),
    CONSTRAINT PK_TREE PRIMARY KEY  CLUSTERED (ID) 
)

I have added the NLEVEL column to store the level of the nodes useful in further needs. The level starts from 1 for the root.

The access will not be so difficult, but how can we set the appropriate values for NLEFT and NRIGHT? I have chosen the trigger implementation, to be sure that every time a node is inserted, the hierarchical structure is stored correctly. We could update NLEFT and NRIGHT values on deletion (even that is not necessary), and if the P_ID field value is changed.

I present only the trigger for insert action, but it can be done in the same manner for all update P_IDs and delete as well. The trigger checks if the new inserted node is a root or a descendant node. If it is a root, it generates a new NLEFT value (it allocates a maximum of 10000 values per hierarchy) and sets the NLEVEL to 1. If it is a child node, set values depending on if the node is the first (there are no other children on the same level) or last (there are other children for corresponding levels and the parent) child added:

SQL

IF @P_ID IS NULL 
BEGIN
    SELECT @MAX_VALUE = ISNULL(MAX(NLEFT), 0) + 10000 
           FROM TREE WHERE P_ID IS NULL
    SELECT @NLEVEL = 1
END
ELSE
BEGIN
    SELECT @MAX_VALUE = ISNULL(MAX(NRIGHT), 0) 
           FROM TREE WHERE P_ID = @P_ID AND ID < @ID
    SELECT @NLEVEL = ISNULL(NLEVEL, 0) 
           FROM TREE WHERE P_ID = @P_ID AND ID < @ID
    IF @MAX_VALUE = 0
    BEGIN
        SELECT @MAX_VALUE = ISNULL(NLEFT, 0) FROM TREE WHERE ID = @P_ID
        SELECT @NLEVEL = ISNULL(NLEVEL, 0) + 1 FROM TREE WHERE ID = @P_ID
    END
END
UPDATE TREE SET NLEFT = @MAX_VALUE + 1, 
       NRIGHT = @MAX_VALUE + 2, 
       NLEVEL = @NLEVEL WHERE ID = @ID

After the values are stored, the trigger ensures that the corresponding ancestors have appropriate NLEFT and NRIGHT values. This is done from the root to the leaves, in Depth-First Traversal. First, it identifies the root node, and then it applies the recalculation algorithm:

SQL

IF @P_ID IS NOT NULL
BEGIN
    SELECT @ROOT_ID = ISNULL(P.ID, -1)
    FROM TREE C 
    INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
    WHERE C.ID = @P_ID AND P.P_ID IS NULL
--        EXEC TREE_RECALC @ROOT_ID -- RECURSIVE APPROACH
    EXEC TREE_RECALC_ITER @ROOT_ID -- ITERATIVE APPROACH
END

The recalculation or numbering algorithm is:

get the NLEFT value of the root;
move down the hierarchy, and find the first child (from left to right) of the topmost parent, and give it a NLEFT value of 2;
continue down the hierarchy, setting the NLEFT value with last NLEFT + 1, and if a leaf node is found, count its NLEFT and NRIGHT values (NRIGHT = NLEFT + 1);
continue this numbering for all next possible child nodes at the current level of the hierarchy;
after all leaf children nodes are numbered, go back to their parent, and set its NRIGHT values with the maximum NRIGHT children value + 1;
continue numbering the siblings for the current node;
continue walking the hierarchy until all nodes are numbered;

The algorithm is implemented in two manners: recursive and iterative.

The recursive method is simple, and it is found in the TREE_RECALC_REC stored procedure. The stored procedure checks if the node currently processed has children. If it has, apply itself for all the children, setting the @NRIGHT output parameter with the corresponding value. If the node has no children (it is a leaf node), simply set the NLEFT and NRIGHT values:

SQL

SELECT @NCOUNT = COUNT(*), @NLEVEL2 = @NLEVEL + 1
FROM TREE
WHERE P_ID = @NPARENT

IF @NCOUNT > 0
BEGIN
    SELECT @NNEWLEFT = @NLEFT + 1
    SELECT @NTEMPID = 0
 
    SELECT @NTEMPID = MIN(ID), @NCOUNTER = COUNT(*) 
    FROM TREE 
    WHERE P_ID = @NPARENT AND ID > @NTEMPID
  
    WHILE (@NCOUNTER > 0)
    BEGIN
        EXEC TREE_RECALC_REC @NNEWLEFT, 
             @NTEMPID, @NLEVEL2, @NRIGHT OUTPUT
        SELECT @NNEWLEFT = @NRIGHT + 1
        SELECT @NTEMPID = MIN(ID), @NCOUNTER = COUNT(*) 
        FROM TREE 
        WHERE P_ID = @NPARENT AND ID > @NTEMPID
    END
 
    UPDATE TREE SET NLEFT = @NLEFT, 
           NRIGHT = @NRIGHT + 1, NLEVEL = @NLEVEL 
    WHERE ID = @NPARENT
    SELECT @NRIGHT = @NRIGHT + 1
END
ELSE
BEGIN
    SELECT @NRIGHT = @NLEFT + 1
    UPDATE TREE SET NLEFT = @NLEFT, 
           NRIGHT = @NRIGHT, 
           NLEVEL = @NLEVEL WHERE ID = @NPARENT
END

As it is known, the 32 nested levels limit forces us to not store more than 32 levels on a hierarchy.

The iterative manner solves the 32 nested levels limit issue. The workaround is to use a "stack" as it is explained in the Expanding Hierarchies MSDN article. The "stack" is a temporary table which stores all the nodes which belong to the same parent, starting with the root. The nodes are processed in a loop using a level variable, and when there is no node with the current level in the "stack", the level is decreased with 1 (the previous parent level is processed). The level variable takes values form 1 to MAX(hierarchy level). The same algorithm may be applied for a child when you want to retrieve its ancestors, but the level variable will start from the child level to 1. When the last leaf node in a sub-hierarchy (the node with the greatest ID value on its level and parent) is processed, the numbering algorithm should be applied over the ancestors. For this, we will need another "stack" table which allows numbering from child to ancestors. The implementation takes much overhead indeed, but the nested levels are limitless. The stored procedure TREE_RECALC_ITER contains the iterative recalculation algorithm:

SQL

WHILE @NLEVEL > 0
BEGIN
    IF EXISTS(SELECT * FROM #TPC WHERE NLEVEL = @NLEVEL)
    BEGIN
        SELECT TOP 1 @ID = ID, @P_ID = P_ID, 
               @NAME = NAME FROM #TPC WHERE NLEVEL = @NLEVEL ORDER BY ID

        INSERT INTO #T VALUES(@ID, @P_ID, @NAME, @NLEFT, @NRIGHT, @NLEVEL)

        DELETE FROM #TPC WHERE NLEVEL = @NLEVEL AND ID = @ID

        INSERT INTO #TPC SELECT ID, P_ID, NAME, 
               @NLEVEL + 1 FROM TREE WHERE P_ID = @ID
        IF @@ROWCOUNT > 0
            SET @NLEVEL = @NLEVEL + 1

        IF EXISTS(SELECT ID FROM TREE WHERE P_ID = @ID)
        BEGIN
            SET @NLEFT = @NLEFT + 1
            SET @NRIGHT = @NLEFT + 1
        END
        ELSE
        BEGIN
            IF @ID = (SELECT MAX(ID) FROM TREE WHERE 
                      NLEVEL = @NLEVEL AND P_ID = @P_ID)
            BEGIN
                PRINT LTRIM(RTRIM(STR(@ID))) + '    ' + 
                      @NAME + '    ' + LTRIM(RTRIM(STR(@NRIGHT)))
                SET @NLEVEL2 = 1
                TRUNCATE TABLE #TCP
                INSERT INTO #TCP SELECT ID, P_ID, NAME, 
                            @NLEVEL2 FROM #T WHERE ID = @P_ID
                WHILE @NLEVEL2 > 0
                BEGIN
                    IF EXISTS(SELECT * FROM #TCP WHERE NLEVEL = @NLEVEL2)
                    BEGIN
                        SELECT TOP 1 @ID2 = ID, @P_ID2 = P_ID, 
                                     @NAME2 = NAME FROM #TCP 
                                     WHERE NLEVEL = @NLEVEL2 ORDER BY ID
                        SET @NRIGHT = @NRIGHT + 1
                        UPDATE #T SET NRIGHT = @NRIGHT WHERE ID = @ID2
                        DELETE FROM #TCP WHERE NLEVEL = @NLEVEL2 AND ID = @ID2
                        INSERT INTO #TCP SELECT ID, P_ID, NAME, 
                               @NLEVEL2 + 1 FROM #T WHERE ID = @P_ID2
                        IF @@ROWCOUNT > 0
                            SET @NLEVEL2 = @NLEVEL2 + 1
                    END
                    ELSE
                    BEGIN
                        SET @NLEVEL2 = @NLEVEL2 - 1
                    END
                END

            END
            ELSE
            BEGIN
                SET @NLEFT = @NRIGHT + 1
                SET @NRIGHT = @NLEFT + 1
            END
        END

    END
    ELSE
    BEGIN
        SET @NLEVEL = @NLEVEL - 1
        SELECT @NLEFT = MAX(NRIGHT) + 1 FROM #T WHERE NLEVEL = @NLEVEL
        SET @NRIGHT = @NLEFT + 1
    END
END

Tree structure access

The hierarchy can now be accessed easily using simple SQL statements:

select an entire hierarchy:

SQL

SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1

select all the ancestors for a node:

SQL

SELECT P.ID, P.NAME, P.P_ID
FROM TREE C 
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE C.ID = 5

select all the leaf nodes in a hierarchy:

SQL

SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1 AND C.NRIGHT - C.NLEFT = 1

retrieve all the non-leaf nodes in a hierarchy:

SQL

SELECT C.ID, C.P_ID, C.NAME
FROM TREE C
INNER JOIN TREE P ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = 1 AND C.NRIGHT – C.NLEFT <> 1

select an entire structure in XML mode (with two nested levels, as it is implemented in the TREE_GET_XML stored procedure):

SQL

SELECT 
    1 AS TAG
    , NULL AS PARENT
    , '' AS [root!1]
    , NULL AS [node!2!id]
    , NULL AS [node!2!p_id]
    , T.NAME AS [node!2!name]
FROM TREE T
WHERE T.ID = @ROOT_ID
UNION
SELECT 2, 1, NULL, C.ID, C.P_ID, C.NAME
FROM TREE P
INNER JOIN TREE C ON C.NLEFT BETWEEN P.NLEFT AND P.NRIGHT
WHERE P.ID = @ROOT_ID
FOR XML EXPLICIT

The last XML output is not so useful for the clients. They probably need the underlying tree structure which SQL Server XML features can't obtain, and not this "fake" ID-parent ID implementation. So, the next idea will be to re-arrange the "fake" tree structure in a XML nested hierarchy. This will be done using the TREE_GET_XML stored procedure, an XML web service, and a general XSL transform stylesheet. Using id and p_id attributes, the stylesheet will put the nodes in a proper tree order. The stylesheet is incredibly simple. It applies a template starting from the root node (//*[not(@p_id)]), and this template applies itself recursively on all the nodes that has the p_id attribute value equal to its id attribute value (//*[@p_id = $id]):

XML

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
    <xsl:apply-templates 
        select="//*[name() = 'node' and not(@p_id)]"/>
</xsl:template>

<xsl:template match="*">
    <xsl:variable name="id" select="@id"/>
    <xsl:element name="node">
        <xsl:apply-templates select="@*"/>
        <xsl:apply-templates select="//*[@p_id = $id]"/>
    </xsl:element>
</xsl:template>

<xsl:template match="@*">
    <xsl:copy-of select="."/>
</xsl:template>

</xsl:stylesheet>

The tree_util XML web service contains two web methods:

GetDirectTree – returns the XML output as it is gathered by TREE_GET_XML;
GetStructuredTree – returns the XML output gathered by TREE_GET_XML and transformed in an underlying tree structure.

Tree to OLAP transformation

Suppose that this tree structure will serve as a data source for an OLAP cube dimension. The dimensions and the cube can be created easily using Analysis Services. But what happens when the tree structure has changed, having one level more? Is it enough just to refresh the dimensions and the cube? The answer is no, the cube will not reflect the changes.

Let's imagine that we have a data cube which has a TREE_OLAP dimension (doesn't matter what it contains – countries, regions, and cities; departments; geographical areas etc.) and a TIME dimension. A problem that I consider important in OLAP cubes creation and update is, what happens if the main structure of a dimension is changing and I have to add or delete some levels? How easily can I update the dimension and the cube? And, if possible, do that automatically.

TREE_OLAP dimension has as data source the TREE_OLAP table which contains the tree nodes in a totally different structure. The ID-parent ID tree implementation doesn't help the OLAP dimensions to get the data and to process it. They need a more redundant structure to arrange tree nodes on different levels. They need a column in the data source for every dimension level. The structure they need is the one shown in the second picture at the beginning of this article.

The TREE_OLAP table must be recreated every time the TREE hierarchy is changed. The columns in this table are the tree ID taken from the TREE table and, for every level of the hierarchy, a varchar column named "N", and the level value (e.g.: if the tree structure has three levels, the TREE_OLAP table will have 4 columns: ID int, N1 varchar(100), N2 varchar(100), and N3 varchar(100)).

The TREE_OLAP table is re-created and filled with the appropriate values using the TREE_2_OLAP stored procedure. The procedure checks the maximum level of the hierarchy and re-creates the TREE_OLAP table. To populate this table, it must traverse the tree in a similar manner as it performs the TREE_RECALC_ITER stored procedure (with two "stack" temporary tables). For the currently processed node, the procedure inserts a record in the TREE_OLAP table at the corresponding "N" + LEVEL column:

SQL

SET @SQL = N'INSERT INTO TREE_OLAP VALUES(@ID'
SET @I = 1
WHILE @I <= @MAX_LEVEL
BEGIN
    IF @I = @NLEVEL
    SET @SQL = @SQL + N', @NAME'
    ELSE
        SET @SQL = @SQL + N', ''<none>'''
    SET @I = @I + 1
END
SET @SQL = @SQL + N')'
EXEC sp_executesql @SQL, N'@ID INT, 
     @NAME VARCHAR(100)', @ID = @ID, @NAME = @NAME

After that, for each ancestor, perform the update for the inserted record on the corresponding ancestor level column:

SQL

SET @SQL = N'UPDATE TREE_OLAP SET N' + 
           LTRIM(RTRIM(STR(@NLEVEL2))) + ' = @NAME WHERE ID = @ID'
EXEC sp_executesql @SQL, N'@ID INT, @NAME VARCHAR(100)', 
                   @ID = @ID, @NAME = @NAME2

Creating the Sales cube and dimensions

If we add some records in the TREE table, add the daily records for 2005 and 2006 in the TIME_BY_DAY table, and import data from the sales.txt file (which contains the TREE_ID, TIME_ID, and SALES_VALUE fields) into the SALES table, we can create the Sales cube. The sales.txt file contains records related only for 2005.

The time dimension is general, and implemented as it is in the Analysis Services "FoodMart 2000" sample database. The fnIsLeapYear, fnGetDaysNumber functions and the TIME_BY_DAY_GEN stored procedure generates daily records for a specified year in the TIME_BY_DAY table which will serve as the data source for the TIME dimension. The TREE_OLAP dimension is based on a star schema (only one table per dimension). The fact table called SALES has the following structure: ID int, TREE_ID int (foreign key to the ID column in the TREE table), TIME_ID int (foreign key to the ID column in the TIME_BY_DAY table), SALES_VALUES float (the measure column).

The cube data visualization will be:

The RefreshSalesCube application

The changes in the dimension and in the cube database can be done manually, using Analysis Services user interfaces, and programmatically using the DSO (Decision Support Objects – interop assembly) namespace for .NET framework 1.1, and AMO namespace for .NET framework 2.0.

The RefreshSalesCube is a simple console application which:

connects to the local Analysis Server:

DSO.ServerClass dsoServer = new ServerClass();
dsoServer.Connect("localhost");

finds the TREE_DB database:

DSO.OlapCollection coll = (DSO.OlapCollection)dsoServer.MDStores;
DSO.Database dsoDB = (DSO.Database)coll.Item("TREE_DB");

retrieves the Sales data cube:

coll = (DSO.OlapCollection)dsoDB.MDStores;
DSO.Cube dsoCube = (DSO.Cube)coll.Item("Sales");

retrieves the TREE_OLAP dimension:

coll = (DSO.OlapCollection)dsoDB.Dimensions;
DSO.Dimension dsoDim = (DSO.Dimension)coll.Item("TREE_OLAP");

re-creates and re-processes the dimension, detaching the shared dimension from the cube, removing dimension levels, and re-creating the new levels based on the columns of the new TREE_OLAP table:

dsoCube.Dimensions.Remove(dsoDim.Name);

int i = dsoDim.Levels.Count;
while(i > 0)
{
    dsoDim.Levels.Remove(i);
    i--;
}

SqlConnection cnn = new SqlConnection(
  ConfigurationSettings.AppSettings["ConnectionString"]);
cnn.Open();
SqlDataAdapter da = new SqlDataAdapter("SELECT TOP" + 
                         " 0 * FROM TREE_OLAP", cnn);
DataTable dt = new DataTable();
da.Fill(dt);
cnn.Close();

int nColumns = dt.Columns.Count;
int adVarChar = 200;
for(i = 1; i < nColumns; i++)
{
    DSO.Level lvl = (DSO.Level)dsoDim.Levels.AddNew(
                     dt.Columns[i].ColumnName, 
                     DSO.SubClassTypes.sbclsRegular);
    lvl.MemberKeyColumn = "\"dbo\".\"TREE_OLAP\".\"" + 
                          dt.Columns[i].ColumnName + "\"";
    lvl.ColumnType = (short)adVarChar;
    lvl.ColumnSize = 100;
}
dsoDim.Process(DSO.ProcessTypes.processFull);

re-processes the cube, adding the dimension and re-creating the join between the fact table and the dimension tables:

DSO.Dimension dim = (DSO.Dimension)
   dsoCube.Dimensions.AddNew("TREE_OLAP", 
   DSO.SubClassTypes.sbclsRegular);
dim = dsoDim;

dsoCube.JoinClause = "\"dbo\".\"Sales\".\"TREE_ID\" =" + 
                     " \"dbo\".\"TREE_OLAP\".\"ID\" AND" + 
                     " \"dbo\".\"Sales\".\"TIME_ID\" =" + 
                     " \"dbo\".\"TIME_BY_DAY\".\"ID\"";

dsoCube.Update();

dsoCube.Process(DSO.ProcessTypes.processFull);

Updating the Sales cube and the TREE_OLAP dimension

Let's add a new record in the TREE table, which will increase the level of the hierarchy:

SQL

INSERT INTO TREE(ID, P_ID, NAME) VALUES(21, 20, 'MamaW')

The new record will have a correspondent in the fact table, in a day in 2006, let's say, '2006-07-18', with a value for SALES_VALUES of 1000:

SQL

DECLARE @TIME_ID INT
SELECT @TIME_ID = ID FROM TIME_BY_DAY WHERE T_DATE = '2006-07-18'
INSERT INTO SALES VALUES(21, @TIME_ID, 1000)

We can just re-create the TREE_OLAP table:

SQL

EXEC TREE_2_OLAP 1

and run the RefreshSalesCube application.

The new cube data visualisation will reflect the changes without a manual update:

Conclusion

Using the implementation on the OLTP database, and a few lines of code to re-create and re-process the cube, the structure transformation and data changing is transparent for the reporting clients which consume the cube data. The process can be automated and scheduled using SQL Server jobs, and the time and work performance can be improved.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)