|
Hi Alan,
I am interested in this algorithm. But from the link, I can not find any reference implementation in C#. Any ideas?
regards,
George
|
|
|
|
|
Might as well use it.
Need a C# Consultant? I'm available.
Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway
|
|
|
|
|
Hi Ennis,
Standard MD5 result is 16-byte. Int32 is 4 byte, could you advise how to store MD5 result in Int32 please -- and at the same time, keep the result as random?
regards,
George
|
|
|
|
|
I was mistaken. It is 128 bits not 32.
Need a C# Consultant? I'm available.
Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway
|
|
|
|
|
Well, Ennis. It is ok. Any ideas to map to Int32?
regards,
George
|
|
|
|
|
The only real choice is to locate a hash function that hashes to 32 bits. Personally, I don't see the problem with the GetHashCode method in .NET, it seems like a good option.
Need a C# Consultant? I'm available.
Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway
|
|
|
|
|
Thanks Ennis,
My concern is the function does not provide the same value in different CLR version. It is hard to maintain data consistency. Any comments?
regards,
George
|
|
|
|
|
The method is provided exclusively for use with hash tables and is not provided for any sort of persistence reason. While I can't speak for the different result I can suggest MS chose a more efficient algorithm with less collisions.
Need a C# Consultant? I'm available.
Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway
|
|
|
|
|
Thanks Ennis,
My scenario is like this.
1. Using .Net 2.0 code generate hash code using GetHashCode function for name "abc";
2. We can further using the hash value as the unique identifier for "abc", e.g. check whether such hash value exists is the same as checking "abc" existence;
3. Upgrade to .Net 3.5, and calculate new hash value for "abc", since the new value does not match the old hash value, the system will believe "abc" does not exist and some other value exists.
Do you think this is an issue?
regards,
George
|
|
|
|
|
Like I said, you can't use GetHashCode for persistence. In fact, there is no guarantee that within the same .NET version it will generate the same hash.
Your scenario looks like a rewrite of some common database functionality which would be best stored in a database.
There is one other option, load the names into a hash table at start-up using the names. Then you can check for existence using a hash look-up without needing the hash code.
Need a C# Consultant? I'm available.
Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway
|
|
|
|
|
Yes, Ennis!
My purpose is to let database VARCHAR small enough (INT32) to save memory. Since the # of strings are large. It is good to see you understand my needs without looking at my code. Magic.
Now, I create an INT32 column and a VARCHAR column and using INT32 as the hash code.
"load the names into a hash table at start-up using the names" -- since the # of strings are large, I think the elapsed time is big.
regards,
George
|
|
|
|
|
George_George wrote: 2. We can further using the hash value as the unique identifier for "abc", e.g. check whether such hash value exists is the same as checking "abc" existence;
No, that is not correct. Checking the hash code is not the same as checking the actual value.
It's only possible to get a unique 32-bit hash code if you exclusively have very short strings, i.e. not more than four characters. If character codes outside the regular ASCII character set is used, you can't have longer strings than two characters. If you have longer strings than that, it's not possible to guarantee a unique hash code.
George_George wrote: 3. Upgrade to .Net 3.5, and calculate new hash value for "abc", since the new value does not match the old hash value, the system will believe "abc" does not exist and some other value exists.
That's definitely an issue (although not with .NET 3.5 as it still uses framework 2.0). The hash code provided by the GetHashCode method is not intended for persistent storage.
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
Thanks Guffa,
Two more comments,
1.
" i.e. not more than four characters. If character codes outside the regular ASCII character set is used, you can't have longer strings than two characters. If you have longer strings than that, it's not possible to guarantee a unique hash code." -- I am confused. Could you show me a sample please?
2.
"That's definitely an issue (although not with .NET 3.5 as it still uses framework 2.0). The hash code provided by the GetHashCode method is not intended for persistent storage." -- great to learn this from you! Do you have any MSDN or official document support the point? I want to find a link (but failed) to send to my friends interested.
regards,
George
|
|
|
|
|
George_George wrote: " i.e. not more than four characters. If character codes outside the regular ASCII character set is used, you can't have longer strings than two characters. If you have longer strings than that, it's not possible to guarantee a unique hash code." -- I am confused. Could you show me a sample please?
To make a simple example, lets say that we have a number of strings that only contains printble ASCII characters, i.e. each character can only have 96 different values. For a ten character string, there are 66483263599150104576 possible combinations of characters.
As an Int32 can only have 4294967296 different values, there are far from enough values to give each possible string a unique hash code. There is by average 15479341056 different ten-character strings that get the exact same hash code.
(As strings are unicode, in reality each character can have a lot more than 96 different values, which of course greatly increases the number of possible combinations.)
George_George wrote: "That's definitely an issue (although not with .NET 3.5 as it still uses framework 2.0). The hash code provided by the GetHashCode method is not intended for persistent storage." -- great to learn this from you! Do you have any MSDN or official document support the point? I want to find a link (but failed) to send to my friends interested.
MSDN Library: String.GetHashCode method[^]
"The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode."
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
Great analysis, Guffa!
"(As strings are unicode, in reality each character can have a lot more than 96 different values, which of course greatly increases the number of possible combinations.)" -- do you mean unicode strings are more prone to have the same hash value (i.e. conflicting) for the same input?
regards,
George
|
|
|
|
|
George_George wrote: do you mean unicode strings are more prone to have the same hash value (i.e. conflicting) for the same input?
In .NET all strings are unicode.
No, I mean that if you consider the full unicode character set, there are a lot more possible combinations of characters in a ten character string. That means that there are more combinations that share the same hash code, but that doesn't mean that any two strings have a higher chance of sharing the same hash code.
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
Thanks Guffa!
I agree. Cool!
regards,
George
|
|
|
|
|
If the distribution is good, you can just use any 32 bits from the 128 bits.
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
Thanks Guffa,
Two more questions,
1.
"distribution is good" -- distribution of string to hash or distribution of MD5 result?
2.
"distribution is good" -- how do you define good generally?
regards,
George
|
|
|
|
|
Is there something wrong with the GetHashCode method in the string class?
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
It's not enterprisey.
Need a C# Consultant? I'm available.
Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway
|
|
|
|
|
Ennis Ray Lynch, Jr. wrote: It's not enterprisey.
Oh. I thought that it perhaps was too NIH.
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
NIH is short for?
regards,
George
|
|
|
|
|
Not Invented Here
Despite everything, the person most likely to be fooling you next is yourself.
|
|
|
|
|
Ok, Guffa!
Any existing technologies is fine. Any solution you could provide to me for my original question?
regards,
George
|
|
|
|