In this post, you will learn about an easy to digest hashing algorithm which has advanced usages with parsing, compression, encryption inter alia.
Preface
The year was 2018, the month was close to that of the month known as December. The wind was cold and bitter; The problem was at hand. A particular beast it was; as pure as the driven snow; whole; robust.
Having dealt with such problems on or about that occasion, I was readily equipped to deal with it yet again. This time, I had help!
Introduction
Consider the following string
s:
"ab", "a b", " ab", " ab ", " a b ", ....
Are they different? How so? I would argue they are more than the same - they not so different especially given observation of the frequency domain...
Character Frequence Analysis
We should easily be able to see that the ordinal of substance varies in other type the magnitude is not what changes.
Let us see how we can utilize such a duality astutely to obtain a stable hash code with respect to such white space or other such characters which can be ignored.
Background
I have a bunch of [strings] which diffen only by such white space characters " ", " ", "\t", "\r", "\n", "\f", "\v", potentially also [sic] their ordinals, etc; I would otherwise like to think of such data as the same integral notion.
To do this, we would typically generate an integral value which allows such data to be uniquely identified; this is known as a Hash function. Such value given by such function is known as the hash code. Such a hash code is otherwise referred to as the identity or hash of the given value.
For example, consider the following code:
GetHashCode("a b").Dump(); //Some identity I
GetHashCode("ab").Dump(); //Some identity I
GetHashCode(" a b").Dump(); //Some identity I
GetHashCode("a b ").Dump(); //Some identity I
GetHashCode("").Dump(); //The seed value
GetHashCode("A b ").Dump(); //Some identity i, character case is important
GetHashCode("A B ").Dump(); //Some identity ii, character case is important
GetHashCode("A b ", true).Dump(); //Some identity H, character case not important
GetHashCode("A B ", true).Dump(); //Some identity H, character case not important
For a given example, I have the set of chars {'t','e','s','t'}
, I would like to generate a view of the contents of this set such that the resulting integral is a value which can be found no matter if I go forwards / backwards or side to side whilst calculating said hash code.
This algorithm is agnostic of whitespace allowing for sets like {'t','','e','','s','','t',''}
OR even sets which are similar like {'T','','e','','s','','','T',''}
.
Using the Code
Here, you will find implementations of the algorithm in C(++)/(#) ECMAScript, Scala, Python, Go, Kotlin & Swift.
function FMHash(obj, CaseInsensitive)
{
let hash1 = 42; let hash2 = 42; for(let i = 0; i < obj.length; ++i)
{
const ch = CaseInsensitive ? obj.charAt(i).toLowerCase() : obj.charAt(i);
if (/\s/.test(ch))
{
++hash2;
continue;
}
hash1 += ch.charCodeAt(0) * (1 + i - hash2);
}
return hash1 + hash2 - obj.length;
};
int FMHash(char * obj, int CaseInsensitive)
{
int hash1 = 42, hash2 = hash1, i = 0;
for(; *obj != '\0'; ++obj, ++i)
{
char ch = CaseInsensitive ? std::tolower(*obj) : *obj;
if (std::isspace(ch))
{
++hash2;
}else{
hash1 += ch * (1 + i - hash2);
}
}
return hash1 + hash2 - i;
}
def FMHash(obj, case):
alen = len(obj)
hash1 = 42
hash2 = hash1
if case : obj = obj.lower()
for i in range(0, alen):
if obj[i].isspace() : hash2 = hash2 + 1
else : hash1 += ord(obj[i]) * (1 + i - hash2)
return hash1 + hash2 - alen
func FMHash(obj string, caseInsensitive bool) int {
hash1 := 42
hash2 := hash1
for i, c := range obj {
if unicode.IsSpace(c) {
hash2 = hash2 + 1
} else {
if caseInsensitive {
c = unicode.ToLower(c)
}
hash1 += int(c) * (1 + i - hash2)
}
}
return hash1 + hash2 - len(obj)
}
def fm_hash(obj: String, caseInsensitive: Boolean): Int = {
var hash1 = 42
var hash2 = hash1
var newS = if (caseInsensitive) obj.toLowerCase else obj
for((c,i) <- newS.zipWithIndex) {
if (c.isWhitespace) {
hash2 = hash2 + 1 }else {
hash1 += c * (1 + i - hash2) } }
hash1 + hash2 - obj.length - 1;
}
fun fm_hash(obj: String, caze: Boolean): Int {
var hash1 = 42
var hash2 = hash1;
var newS = if(caze) obj.toLowerCase() else obj;
for (i in newS.indices) {
var c = obj[i]
if(c.isWhitespace()){
hash2 = hash2 + 1
}else{
hash1 += c.toInt() * (1 + i - hash2)
}
}
return hash1 + hash2 - obj.length
}
extension Character {
var asciiValue: UInt32? {
return String(self).unicodeScalars.filter{$0.isASCII}.first?.value
}
}
extension String {
var asciiArray: [UInt32] {
return unicodeScalars.filter{$0.isASCII}.map{$0.value}
}
func fm_hash() -> Int {
var h = 42 as Int!
var h2 = h
for i in 0..<asciiArray.count {
let c = Int(asciiArray[i]);
if(c > 0x20){
h = h! + c * (1 + i - h2!)
}else{
h2 = h2! + 1;
}
}
return h!
}
}
public static int FMHash(string obj, bool CaseInsensitive = false)
{
unchecked
{
int hash1 = 42; int wsc = 42; for (int i = 0; i < obj.Length; ++i)
{
char ch = CaseInsensitive ? char.ToLowerInvariant(obj[i]) : obj[i];
if (char.IsWhiteSpace(ch))
{
++wsc;
continue;
}
int c = ch.GetHashCode() * (1 + i - wsc);
hash1 += c;
}
return hash1 + wsc - obj.Length; }
}
public class WhitespaceAgnosticStringComparer : System.StringComparer
{
public bool CaseInsensitive;
public override int Compare(string x, string y) => GetHashCode(x) - GetHashCode(y);
public override bool Equals(string x, string y) => GetHashCode(x) == GetHashCode(y);
public int GetHashCode(string obj, bool CaseInsensitive = false)=>
FMHash(x, CaseInsensitive);
public override int GetHashCode(string obj) => GetHashCode(obj, CaseInsensitive);
}
Points of Interest
When you think something is too hard or you need to split or substring or do some crazy manipulation, think twice, this is what we advise.
This code can easily handle patterns of case where one may have used snake_Case in one scenario and Pascalcase in another. You can also handle White Space variations of the same WhiteSpace textual convention.
Be careful with surrogate characters and the essence of the communitive property combined with the overflow property which in certain [niche] cases would allow for hash code collisions.
You can use this algorithm to provide encryptions or compressions:
["This","Is","Text"], ["This","Is","Text","Also"],["So","Is","This","Text"], etc.
[0,1,2],[0,1,2,3], [4,1,0,2]
You can use this algorithm to parse padded or incomplete textual conventions:
"0x 00 00 00 01","0x_00_00_00_01", " 1", etc.
Try It Online
Academia / Acknowledgements
- FMHash - Another hash function using the same name with a different problem domain
- String Hashing - Other attempts at string hashing
- Minimal Perfect Hash Function - A regular hash function turns a key (a string or a number) into an integer.
- Code Golf - A code golfing experiment
- nVidia - How to use strings in Kernel / Device Functions
- See also - A Simple Hash Table on the GPU
History
- 31st January, 2021: Initial post