Introduction
.NET Framework provides the 'GetHashcode()
' function returning a 32bit Integer. You can convert 'string
' to 32bit Integer but it doesn't guarantee a unique number. Here is the function that converts string
to a unique 64bit integer, avoids collision domain level string
store, compare and trace.
Background
If you need to store a large amount of URLs in your database, you must define 'character' index for manipulation. If you can generate a matching 'one-string
-to-one-numeric-value' that can use as a numeric 'Key' instead of a variant length of string
.
Using the Code
- Convert a variable length of
string
to fixed length of hashcode, and it must have fast hashing speed, so use .NET provided System.Security.Cryptography.SHA256CryptoServiceProvider
. - Convert 32 byte hashcode to 8 byte integer, avoiding making a collision.
static Int64 GetInt64HashCode(string strText)
{
Int64 hashCode = 0;
if (!string.IsNullOrEmpty(strText))
{
byte[] byteContents = Encoding.Unicode.GetBytes(strText);
System.Security.Cryptography.SHA256 hash =
new System.Security.Cryptography.SHA256CryptoServiceProvider();
byte[] hashText = hash.ComputeHash(byteContents);
Int64 hashCodeStart = BitConverter.ToInt64(hashText, 0);
Int64 hashCodeMedium = BitConverter.ToInt64(hashText, 8);
Int64 hashCodeEnd = BitConverter.ToInt64(hashText, 24);
hashCode = hashCodeStart ^ hashCodeMedium ^ hashCodeEnd;
}
return (hashCode);
}
Collision and Performance Test
Tested platform: Core2Duo, Windows 2003 Server SP2, .NET Framework 3.5 SP1
10,000,000 Times generate GetInt64HashCode
Collision: Not found
100,000 Times generate ElapsedTime: 830 milliseconds
10,000,000 Times generate .NET Framework provided object.GetHashCode
Collision: 4,150 found
100,000 Times generate ElapsedTime: 35 milliseconds
Points of Interest
I know that Cryptography.SHA256
does not provide perfect collision avoided hash value and compressed 32Byte to 8Byte can increase collision probability, but I think the above function shows enough performance and avoids collision.
Your reply will make the function more efficient and reliable.
This function now uses and collects large amount of URLs. There is no collision for 40,000,000 unique URLs.
History
- 20th March, 2009: Initial post