I have a set of nodes (N) arranged into one or more networks. Each network has one or more root nodes. Each node has a unique ID (UID) that is a 32-bit integer. If there were 100,000 nodes that would be a large problem - 1,000,000 would be almost infeasibly gigantic. I need to check if all nodes are connected to at least one root-node, and I'm trying to design a hash function to help me.
My current code starts at each root node and traces all paths, recording each connected node. It does this without hashing, since previously I could rely on the UIDs being relative small and mostly contiguous. Along with a starting offset, it just uses the UID directly to index into a bitmap array. The bitmap is the compared against the members of N to determine which nodes are not connected.
Now, however, I'm coming across situations where there are large gaps in the UIDS (e.g., jumping from 100,000 to 100,000,000) - this make simply increasing the size of the bitmap more and more impractical, hence the need for a hash. In the general case, provided the user hasn't completely messed up, I would expect the number of unconnected nodes to be less than the number of connected ones (but not necessarily).
How do I, for instance, pick how many buckets I should use for hashing to balance memory use against likelihood of collisions? Can this (should this?) be done dynamically? (i.e., should I dynamically allocate my bucket array based on the size of N?) Is it more efficient to start by hashing all my node UIDs into buckets, then clearing the 'connected' hashes as they are discovered? How can I can I determine the best way to handle collisions? Is it possible to create a perfect hash dynamically?
What I have tried:
Nothing much more than Googling. I'm not a computer scientist, so this is a bit new to me. I'm finding the masses of online information a bit hard to digest - hence the appeal to the experts.