Designing an efficient hash

Question

0.00/5 (No votes)

See more:

I have a set of nodes (N) arranged into one or more networks. Each network has one or more root nodes. Each node has a unique ID (UID) that is a 32-bit integer. If there were 100,000 nodes that would be a large problem - 1,000,000 would be almost infeasibly gigantic. I need to check if all nodes are connected to at least one root-node, and I'm trying to design a hash function to help me.

My current code starts at each root node and traces all paths, recording each connected node. It does this without hashing, since previously I could rely on the UIDs being relative small and mostly contiguous. Along with a starting offset, it just uses the UID directly to index into a bitmap array. The bitmap is the compared against the members of N to determine which nodes are not connected.

Now, however, I'm coming across situations where there are large gaps in the UIDS (e.g., jumping from 100,000 to 100,000,000) - this make simply increasing the size of the bitmap more and more impractical, hence the need for a hash. In the general case, provided the user hasn't completely messed up, I would expect the number of unconnected nodes to be less than the number of connected ones (but not necessarily).

How do I, for instance, pick how many buckets I should use for hashing to balance memory use against likelihood of collisions? Can this (should this?) be done dynamically? (i.e., should I dynamically allocate my bucket array based on the size of N?) Is it more efficient to start by hashing all my node UIDs into buckets, then clearing the 'connected' hashes as they are discovered? How can I can I determine the best way to handle collisions? Is it possible to create a perfect hash dynamically?

What I have tried:

Nothing much more than Googling. I'm not a computer scientist, so this is a bit new to me. I'm finding the masses of online information a bit hard to digest - hence the appeal to the experts.

Posted 19-Feb-19 9:18am

Kyudos

Updated 30-Oct-19 3:57am

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Mehdi Gholam · Answer 1 · 2019-02-19T21:36:00

Solution 1

If you need to check for the existence or lack there of something in a list and you don't have the memory capacity or it is infeasible to store use Bloom filter - Wikipedia[^] .

Bloom filters will give you a definite negative if it is not encoded in the bloom bits, and a probable positive for it's existence (the probability is tune able).

[hash functions by definition will have collision and there is no "perfect" hash function only statistically even hash functions i.e. no clumping of values]

Posted 19-Feb-19 21:36pm

Mehdi Gholam

Comments

Kyudos 21-Feb-19 18:14pm

Thanks for this. I looked into it and it would seem to do the trick. But it is also quite complex. So I might just hit my problem with a slow but reliable hammer in the odd cases where my bitmap doesn't work.

Mehdi Gholam 22-Feb-19 1:30am

How Bloom works is complex, but using it is as simple as a hash function. try this https://gist.github.com/richardkundl/8300092

RugbyLeague · Answer 2 · 2019-10-30T03:57:00

Solution 2

maintain a list of bitmaps rather than one big bitmap- with each bitmap capable of storing 100,000 bits (ie 12,500 bytes) - have each bitmap in some structure (class etc) which also records the id of the first bit in that bitmap - that will allow for sparse bitmaps pretty efficiently

Posted 30-Oct-19 3:57am

RugbyLeague