(untagged)

Optimizing building trees from a database

NielsHoldijk

0.00/5 (No votes)

20 Jan 2005

How a different way of looking at a problem can result in better performance.

Download source code - 24.9 Kb

Introduction

Some time ago, I had to write some code that created an object tree from a DataView. Pretty straight forward job, the code worked and I didn't bother with it for a while.

Then recently I started using the ANTS Profiler and found that the tree construction was actually rather slow. I examined my old code again and came up with a way of speeding it up significantly...

This article explains how the recursive character of the DataView can be used to get a big performance improvement.

Note: the method I present doesn't rely on incremental sorting of the DataView. Sorted backwards or even a mess will work. This is a very welcome property if the tree data is from an external, uncontrollable source.

What's in the ZIP?

The ZIP-file contains the tree building library code, and the test application code. Also included is the tree DataSet in XML format. The tree consists of 745 nodes, going about five levels deep. The test application is a simple console application that creates TreeNodes using the slow and the fast way and shows the loading time in seconds. It does this five times to get a better idea of the performance boost.

From the tree's point of view

In my original code, I handled the problem from the tree's point of view. The tree always has a root node, so I started with that node, and searched for its child nodes in the DataView. For every child node, I searched for their respective child nodes as well, and so on.

The DataSet consists of a collection of records with three columns: NodeID, ParentNodeID and NodeText. The data is recursive: The ParentNodeID refers to the NodeID of the parent node. The database makes sure that a node can't be attached to a non-existent parent node. Because of this rule, a root node is made by having the node point to itself.

  public class TreeNode
  {
    // Declarations

    private int id;               // The ID of the node

    private TreeNode parentNode;  // The parent node of the node

    private string text;          // The payload of the node

    private ArrayList childNodes; // Contains the childnodes of the node


    // Constructor

    public TreeNode(int id, TreeNode parentNode)
    {
      // Initialize the FastTreeNode

      this.id = id;
      this.parentNode = parentNode;
      this.text = "";
      childNodes = new ArrayList();

      // Check if a parent was supplied

      if (parentNode != null)
      {
        //Yes, then let the parentnode know it has a new child

        parentNode.ChildNodes.Add(this);
      }
    }

    // Properties

    public int ID
    {
      get {return id;}
    }

    public TreeNode ParentNode
    {
      get {return parentNode;}
      set
      {
        parentNode = value;
        // Let the parent node know it has a new child,

        // if possible, and if necessary

        if (parentNode != null && !parentNode.ChildNodes.Contains(this))
          parentNode.ChildNodes.Add(this);
      }
    }

    public string Text
    {
      get {return text;}
      set {text = value;}
    }

    public ArrayList ChildNodes
    {
      get {return childNodes;}
    }
  }

The TreeNode class basically reflects the database structure. It has an ID, and it refers to a parent. A root node has a null parent. As a simple payload, the node has a Text property.

public TreeNode LoadTree(DataView dataTree)
    {
      // Prepare the resulting rootnode

      TreeNode rootNode = null;

      // Loop through the dataview

      foreach (DataRowView dataNode in dataTree)
      {
        // Store the current record in typed variables

        int nodeID = (int)dataNode["NodeID"];
        int parentNodeID = (int)dataNode["ParentNodeID"];
        string text = dataNode["NodeText"].ToString();

        // Check if the current node is the rootnode

        if (nodeID == parentNodeID)
        {
          // Yes, so create the node, and start

          // searching for it's child nodes

          rootNode = new TreeNode(nodeID, null);
          rootNode.Text = text;
          LoadChildNodes(rootNode, dataTree);
        }
      }

      // Return the result

      return rootNode;
    }

    private void LoadChildNodes(TreeNode parentNode, DataView dataTree)
    {
      // Loop throught the dataview

      foreach (DataRowView dataNode in dataTree)
      {
        // Store the current record in typed variables

        int nodeID = (int)dataNode["NodeID"];
        int parentNodeID = (int)dataNode["ParentNodeID"];
        string text = dataNode["NodeText"].ToString();

        // Check if the current node is not the root node,

        // and if the current node

        // is a child node of the supplied parent node

        if (nodeID != parentNodeID && parentNodeID == parentNode.ID)
        {
          // Yes, create the node

          // and start searching for it's child nodes.

          TreeNode node = new TreeNode(nodeID, parentNode);
          node.Text = text;
          LoadChildNodes (node, dataTree);
        }
      }
    }

The LoadTree method calls the recursive LoadChildNodes method. This makes the code quite compact, however, there's one big problem with it: for every node, it scans the entire DataView, including already handled nodes. So with the test DataSet, 555770 records will be read. Combined with the fact that the DataView doesn't read too fast, the total time required to place all nodes is considerable.

Also, if you try to load a very deeply branched tree, it might cause stack overflow problems.

From the DataView's point of view

The DataView has a linear character: it starts at record 0 and ends at record x. Just going through the records from 0 to x to create each node doesn't seem possible at first: if its parent node was not created earlier, it's impossible to create the current node. Or is it?

The trick is to let go of the idea that each object you create should be complete. I don't know anything about the parent node, but I do know the ID it will have. So I can create an new node with only the parent's ID and no parent node, and store it in a Hashtable. If I find the ID of the empty node later on in the DataSet, I don't create it but get the stored node from the Hashtable instead and fill in the empty properties.

public TreeNode LoadTree(DataView dataTree)
    {
      // Loop through all records in the dataview

      foreach (DataRowView dataNode in dataTree)
      {
        // Store the current record in some typed variables

        int nodeID = (int)dataNode["NodeID"];
        int parentNodeID = (int)dataNode["ParentNodeID"];
        string text = dataNode["NodeText"].ToString();

        // Check if the node was already created

        if (nodeList.Contains(nodeID))
        {
          // Yes, fill in the missing properties

          TreeNode node = (TreeNode)nodeList[nodeID];
          node.Text = text;
          // If the node is not a root node,

          // and there's no parent yet, look up or create

          if (nodeID != parentNodeID && node.ParentNode == null)
          {
            TreeNode parentNode = null;
            // Check if the parentnode was already created

            if (nodeList.Contains(parentNodeID))
            {
              // Yes, so use that one

              parentNode = (TreeNode)nodeList[parentNodeID];
            }
            else
            {
              // The parentnode doesn't exist yet, so create a partial parentnode.

              parentNode = new TreeNode(parentNodeID, null);
              nodeList.Add(parentNodeID, parentNode);
            }
            node.ParentNode = parentNode;
          }
        }
        else
        {
          // New node, check if it's a root node

          if (nodeID == parentNodeID)
          {
            // Yes, so no need for looking up or creating a parentnode

            TreeNode node = new TreeNode(nodeID, null);
            node.Text = text;
            nodeList.Add(nodeID, node);
          }
          else
          {
            // New child node

            TreeNode parentNode = null;
            // Check if the parentnode was already created

            if (nodeList.Contains(parentNodeID))
            {
              // Yes, so use that one

              parentNode = (TreeNode)nodeList[parentNodeID];
            }
            else
            {
              // No, so create a partial parentnode.

              parentNode = new TreeNode(parentNodeID, null);
              nodeList.Add(parentNodeID, parentNode);
            }

            // Create the new node

            TreeNode node = new TreeNode(nodeID, parentNode);
            node.Text = text;
            nodeList.Add(nodeID, node);
          }
        }
      } 
     
      // Find the rootnode

      IDictionaryEnumerator nodeEnumerator = nodeList.GetEnumerator();
      while (nodeEnumerator.MoveNext())
      {
        TreeNode node = (TreeNode)nodeEnumerator.Value;
        if (node.ParentNode == null)
          return node;
      }

      // Return nothing if no rootnode was found

      return null;
    }

The LoadTree method here is a bit longer but it reads every record only once. Instead of searching nodes in the DataView, they're searched using the much faster Hashtable. The only additional thing that needs to be done at the end is locating the root node, whereas the SlowTree version has it ready at hand: it was the starting point. However, the extra time this extra code requires isn't really noticeable because of the speed with which the tree was built.

The results

On my 1.8GHz laptop, the test application requires just a little above two seconds to get the tree using the slow method. Using the fast method, it only takes about 0.004 seconds to create the same tree. That's about 500 times faster!

Note: of course, the test tree is a very simple tree. In real-life applications, each tree node might have a bigger payload than just a text. This will probably slow it down a little, but it will still be much faster than the original slow tree. I've seen improvements up to 400 times faster in my applications.

Conclusion

Applications can have a bad performance because:

it calls (fast) methods very often, or
it calls methods that are very slow, or, worst of all,
it calls methods that are very slow very often.

The solution offered in this article avoids this by keeping the calls to the slowest method (reading records from a DataView) to an absolute minimum. Admittedly, this is only possible because of the close relationship between the child- and parent objects, and some clever using of the little information we have about the objects.

Acknowledgements

I'd like to thank:

Daniel Strigl for his easy-to-use HighPerformanceTimer class[^] , which I used to take the measurements.
My colleague Jan, who originally came up with the described idea.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here