See more:
(untagged)
For many varied reasons, I decided to write my own balanced binary tree library. One feature of my library that I haven't seen in other freely available libraries is "clusters", and a "recluster" function.
The problem that I am attacking is that as elements are randomly added to the tree, the tree rotations that automatically happen to keep the tree balanced tend to spread the edge pointers all over the place. The end result is that in a large tree, a typical traversal can require reading nodes from all over physical memory (or disk pages): a very inefficient use of cache. By clustering several subtree "near the root nodes" together in physical memory, tree traversals can be sped-up significantly by keeping nodes that always get visited near one another within physical memory (or on the same disk page). For example, in a properly clustered tree, with clusters large enough to hold 4 levels of records (1+2+4+8 = 15 records), a tree of height 16 (>45000 records) can be traversed in only 4 cluster reads instead of 16 reads that might be required of a non-clustered tree.
What I don't know is a good way to measure how "out-of-cluster" a tree is. I'd like to be able to determine when a tree needs to be reclustered. I'm sure there is good theory out there for this; and the idea doesn't seem to be that difficult, but I haven't run onto the solution yet. I've spent a little bit of time googling for a method, and probably a couple of hours thinking about it. Nothing yet.
Any ideas or helpful links?