Introduction
Hashing is the transformation process of value into a usually shorter fixed-length key/value that represents the original value. A few days ago, we had to use hash comparison to sync data between two systems via API (obviously, it wasn't the most efficient way to use API for data syncing, but we had no option to add any change at source end).
Background
What we were doing:
- Creating a hash string at our end after object JSON deserialization
- Comparing that hash string with an existing DB row by a unique identifier (Primary key)
- If no row found by the unique identifier (Primary key), adding a new row to the DB
- If the hash string wasn't the same, updating the existing row with new values
- And few other sync log processes
Everything was working as expected until we refactored the existing code (changed name of a few models and properties). The hash string was being generated from the entire object (including all the values) rather than considering specific properties. The way we were creating the hash string was actually wrong. Let's check a few hash string examples.
Hash Helper Class
This is the utility class to manage hash related operations.
using System.IO;
using System.Linq;
using System.Runtime.Serialization.Formatters.Binary;
using System.Security.Cryptography;
using System.Text;
public class HashHelper
{
public byte[] Byte(object value)
{
using (var ms = new MemoryStream())
{
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(ms, value == null ? "null" : value);
return ms.ToArray();
}
}
public byte[] Hash(byte[] value)
{
byte[] result = MD5.Create().ComputeHash(value);
return result;
}
public byte[] Combine(params byte[][] values)
{
byte[] rv = new byte[values.Sum(a => a.Length)];
int offset = 0;
foreach (byte[] array in values)
{
System.Buffer.BlockCopy(array, 0, rv, offset, array.Length);
offset += array.Length;
}
return rv;
}
public string String(byte[] hash)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++)
{
sb.Append(hash[i].ToString("x2"));
}
var result = sb.ToString();
return result;
}
public byte[] Hash(params object[] values)
{
byte[][] bytes = new byte[values.Length][];
for(int i=0; i < values.Length; i++)
{
bytes[i] = Byte(values[i]);
}
byte[] combined = Combine(bytes);
byte[] combinedHash = Hash(combined);
return combinedHash;
}
public string HashString(string value, Encoding encoding = null)
{
if (encoding == null)
{
encoding = Encoding.ASCII;
}
byte[] bytes = encoding.GetBytes(value);
byte[] hash = Hash(bytes);
string result = String(hash);
return result;
}
public string HashString(params object[] values)
{
var hash = Hash(values);
var value = String(hash);
return value;
}
}
Consideration
- Using MD5 hash
Hash(byte[] value)
- Any
null
value is considered as 'null'
string Byte(object value)
Object to Hash String Process
- Create bytes of that object
Byte(object value)
- Create hash bytes from the object bytes
Hash(byte[] value)
- String from hash bytes
String(byte[] hash)
A Combined Hash of Multiple Objects
- Create bytes of each object
Byte(object value)
- Combine or sum the bytes
Combine(params byte[][] values)
- Create hash bytes from the combine or sum the bytes
Hash(byte[] value)
- String from hash bytes
String(byte[] hash)
Alternatively:
- Create combined hash bytes
Hash(params object[] values)
- String from hash bytes
String(byte[] hash)
Methods We Are Going to Use More Frequently
- Create a hash string of any string
HashString(string value, Encoding encoding = null)
- Create hash/combine hash string of any/group of object
HashString(params object[] values)
Hash of Entire Object
The data class or model:
[Serializable]
class PeopleModel
{
public long? Id { get; set; }
public string Name { get; set; }
public bool? IsActive { get; set; }
public DateTime? CreatedDateTime { get; set; }
}
Creating a hash of the model:
var peopleModelHashString = hashHelper.HashString(new PeopleModel()
{
Id = 1,
Name = "Anders Hejlsberg",
IsActive = true,
CreatedDateTime = new DateTime(1960, 12, 2)
});
Important to Remember
This hash depends on both object structures and assigned values. The generated hash will not be the same even if we assign the same values to the properties, but added some changes like:
- Class/Model name change
- Property name change
- Namespace name change
- Property Number change (add or remove any property)
to the model. And in a development environment, refactoring can take place any time.
Hash of Data Values
Let's make a hash using only values. Creating an interface IHash
.
public interface IHash
{
string HashString();
}
Using IHash
to a model and using hash helper inside the method HashString()
.
class People : IHash
{
public long? Id { get; set; }
public string Name { get; set; }
public bool? IsActive { get; set; }
public DateTime? CreatedDateTime { get; set; }
public string HashString()
{
var value = new HashHelper().HashString
(Name, IsActive, CreatedDateTime);
return value;
}
}
This way, the model structure is not taking part in the hash generation process, only specific property values (Name
, IsActive
, CreatedDateTime
) are being considered.
Hash will remain the same until no new value has been set to any of those properties. Any structural change (name change, property add/remove, etc.) to the model will not affect the hash string.
Hash Result
people = new People()
{
Id = 1,
Name = "Dennis Ritchie",
IsActive = false,
CreatedDateTime = new DateTime(1941, 9, 9)
};
hashString = people.HashString();
Other Tests
Working fine with null
object values:
string hashString;
var people = new People();
hashString = people.HashString();
We will not be able to create the entire People
class as it is not using [Serializable]
:
var hashHelper = new HashHelper();
BONUS: String Hash
It is quite common to create a password/string hash. So here we have it.
string name = "Dipon Roy";
string value = new HashHelper().HashString(name);
Conclusion
- If we have to compare considering values or specific values only, then using Hash of Data Values is the best option.
- But if we need to compare both object structure and values altogether, go for Hash of Entire Object.
References
My first read many years ago
Bytes
Hash Bytes
Combined Bytes
Bytes to String
Limitations
I haven't considered all possible worst scenarios or code may throw unexpected errors for untested inputs. If any, just let me know.
Find Visual Studio 2017 console application sample code as attachment.
History
- 26th June, 2019: Initial version