Records and Clusters

Question

5.00/5 (1 vote)

See more: (untagged)

For many varied reasons, I decided to write my own balanced binary tree library. One feature of my library that I haven't seen in other freely available libraries is "clusters", and a "recluster" function.

The problem that I am attacking is that as elements are randomly added to the tree, the tree rotations that automatically happen to keep the tree balanced tend to spread the edge pointers all over the place. The end result is that in a large tree, a typical traversal can require reading nodes from all over physical memory (or disk pages): a very inefficient use of cache. By clustering several subtree "near the root nodes" together in physical memory, tree traversals can be sped-up significantly by keeping nodes that always get visited near one another within physical memory (or on the same disk page). For example, in a properly clustered tree, with clusters large enough to hold 4 levels of records (1+2+4+8 = 15 records), a tree of height 16 (>45000 records) can be traversed in only 4 cluster reads instead of 16 reads that might be required of a non-clustered tree.

What I don't know is a good way to measure how "out-of-cluster" a tree is. I'd like to be able to determine when a tree needs to be reclustered. I'm sure there is good theory out there for this; and the idea doesn't seem to be that difficult, but I haven't run onto the solution yet. I've spent a little bit of time googling for a method, and probably a couple of hours thinking about it. Nothing yet.

Any ideas or helpful links?

Posted 5-Nov-08 4:40am

DQNOK

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

Member 4194593 · Accepted Answer · 2008-11-05T06:28:00

Have you ever considered B+ trees as implemented by "The C Database Tool Chest", Mix Software. Mix is still on the web - Google "Mix Software". Note that the term "database" is, in fact, just a B+ tree. The database is a tree of memory or file blocks which contain multiple keys and records.

I got this in '85 and have used it ever since. I got the source software (don't know if this is still available). Not Open Source, but any programs that are released can be used without any royalty, only restriction is that it must be linked into an object, no DLLs, and the neither the source nor the library can be released - only the executable with the embedded code. With the source, you can make any modifications you want.

I modified it (originally 16bit - used Borland C to compile) to now use VS2008 thus 32 bit and supports >4GB files. Major rewrite to unwind the INT usage which now is 32 bit - in 16 bit an INT was 16 bit. Many conflicts with structure layout, and old 16 bit databases couldn't be read with the 32 bit version, nor the other way around. I built my version where I could declare which size of INT I wanted so I had only one converter to work with. I created two converters (16 bit and 32 bit) which could read the appropriate data base (in ascending order) and output a flat file with the data it contained, and then read in the flat file and write the other bit sized database.

Dave.