Hash Code Generatrix et the Frequency Domain

jfriedman

2.17/5 (3 votes)

1 Feb 2021CPOL3 min read

When you need a janky [unsafe] way to generate communitive hash codes

In this post, you will learn about an easy to digest hashing algorithm which has advanced usages with parsing, compression, encryption inter alia.

Preface

The year was 2018, the month was close to that of the month known as December. The wind was cold and bitter; The problem was at hand. A particular beast it was; as pure as the driven snow; whole; robust.

Having dealt with such problems on or about that occasion, I was readily equipped to deal with it yet again. This time, I had help!

Introduction

Consider the following strings:

"ab", "a b", " ab", " ab ", " a b ", ....

Are they different? How so? I would argue they are more than the same - they not so different especially given observation of the frequency domain...

Character Frequence Analysis

Character	Weight
a	1
b	1

We should easily be able to see that the ordinal of substance varies in other type the magnitude is not what changes.

Let us see how we can utilize such a duality astutely to obtain a stable hash code with respect to such white space or other such characters which can be ignored.

Background

I have a bunch of [strings] which diffen only by such white space characters " ", " ", "\t", "\r", "\n", "\f", "\v", potentially also [sic] their ordinals, etc; I would otherwise like to think of such data as the same integral notion.

To do this, we would typically generate an integral value which allows such data to be uniquely identified; this is known as a Hash function. Such value given by such function is known as the hash code. Such a hash code is otherwise referred to as the identity or hash of the given value.

For example, consider the following code:

GetHashCode("a b").Dump();        //Some identity I
GetHashCode("ab").Dump();         //Some identity I
GetHashCode(" a b").Dump();       //Some identity I 
GetHashCode("a b ").Dump();       //Some identity I
GetHashCode("").Dump();           //The seed value
GetHashCode("A b ").Dump();       //Some identity i, character case is important
GetHashCode("A B ").Dump();       //Some identity ii, character case is important
GetHashCode("A b ", true).Dump(); //Some identity H, character case not important
GetHashCode("A B ", true).Dump(); //Some identity H, character case not important

For a given example, I have the set of chars {'t','e','s','t'}, I would like to generate a view of the contents of this set such that the resulting integral is a value which can be found no matter if I go forwards / backwards or side to side whilst calculating said hash code.

This algorithm is agnostic of whitespace allowing for sets like {'t','','e','','s','','t',''} OR even sets which are similar like {'T','','e','','s','','','T',''}.

Using the Code

Here, you will find implementations of the algorithm in C(++)/(#) ECMAScript, Scala, Python, Go, Kotlin & Swift.

C++

/*ECMAScript*/
function FMHash(obj, CaseInsensitive)
        {            
            let hash1 = 42;//prime
            let hash2 = 42;//prime            
            for(let i = 0; i < obj.length; ++i)
            {
                const ch = CaseInsensitive ?  obj.charAt(i).toLowerCase() : obj.charAt(i);
                if (/\s/.test(ch))
                {
                    ++hash2;
                    continue;
                }                
                hash1 += ch.charCodeAt(0) * (1 + i - hash2);
            }
            return hash1 + hash2 - obj.length;
        };
//C(++)
int FMHash(char * obj, int CaseInsensitive)
{
    //prime
    int hash1 = 42, hash2 = hash1, i = 0;
    for(; *obj != '\0'; ++obj, ++i)
    {
     char ch = CaseInsensitive ? std::tolower(*obj) : *obj;
     if (std::isspace(ch))
     {
     ++hash2;
     }else{
     hash1 += ch * (1 + i - hash2);
    }
   }
   return hash1 + hash2 - i;
}
//Python
def FMHash(obj, case):    
    alen = len(obj)
    hash1 = 42
    hash2 = hash1
    if case : obj = obj.lower()
    for i in range(0, alen):
        if obj[i].isspace() :  hash2 = hash2 + 1
        else : hash1 += ord(obj[i]) * (1 + i - hash2)
    return hash1 + hash2 - alen
//Go
func FMHash(obj string, caseInsensitive bool) int {
    hash1 := 42
    hash2 := hash1
    for i, c := range obj {
        if unicode.IsSpace(c) { 
            hash2 = hash2 + 1
        } else { 
            if caseInsensitive { 
                c = unicode.ToLower(c)
            }                 
            hash1 += int(c) * (1 + i - hash2) 
        }
    }
    return hash1 + hash2 - len(obj)
}
//Scala
    def fm_hash(obj: String, caseInsensitive: Boolean): Int = { 
    var hash1 = 42
    var hash2 = hash1 
    var newS = if (caseInsensitive) obj.toLowerCase else obj
    for((c,i) <- newS.zipWithIndex) { 
    if (c.isWhitespace) { 
    hash2 = hash2 + 1 }else { 
    hash1 += c * (1 + i - hash2) } } 
    hash1 + hash2 - obj.length - 1; 
}
//Kotlin
fun fm_hash(obj: String, caze: Boolean): Int {
    var hash1 = 42
    var hash2 = hash1;
    var newS = if(caze) obj.toLowerCase() else obj;
    for (i in newS.indices) {
        var c = obj[i]
        if(c.isWhitespace()){
            hash2 = hash2 + 1
        }else{
            hash1 += c.toInt() * (1 + i - hash2)
        }
    }
    return hash1 + hash2 - obj.length
}
//Swift
 extension Character {
    var asciiValue: UInt32? {
        return String(self).unicodeScalars.filter{$0.isASCII}.first?.value
    }
}
extension String {
    var asciiArray: [UInt32] {
        return unicodeScalars.filter{$0.isASCII}.map{$0.value}
    }
    func fm_hash() -> Int {
        var h = 42 as Int!
        var h2 = h
        for i in 0..<asciiArray.count {
            let c = Int(asciiArray[i]);
            if(c > 0x20){
             h = h! + c * (1 + i - h2!) 
            }else{
             h2 = h2! + 1;       
            }
        }
        return h!
    }
}
//C#
 public static int FMHash(string obj, bool CaseInsensitive = false)
        {
            unchecked
            {
                int hash1 = 42;//could be 7 or 5 or 3 or 1
                int wsc = 42;//prime and count of whitespace
                for (int i = 0; i < obj.Length; ++i)
                {
                    char ch = CaseInsensitive ? char.ToLowerInvariant(obj[i]) : obj[i];
                    if (char.IsWhiteSpace(ch))
                    {
                        ++wsc;
                        continue;
                    }
                    int c = ch.GetHashCode() * (1 + i - wsc);
                    hash1 += c;
                }
                return hash1 + wsc - obj.Length;//normalize for the length of the string
            }
        }
//Use me in a Dictionary<string, T> etc
 public class WhitespaceAgnosticStringComparer : System.StringComparer
    {
        public bool CaseInsensitive;

        public override int Compare(string x, string y) => GetHashCode(x) - GetHashCode(y);

        public override bool Equals(string x, string y) => GetHashCode(x) == GetHashCode(y);

        public int GetHashCode(string obj, bool CaseInsensitive = false)=> 
                               FMHash(x, CaseInsensitive);

        public override int GetHashCode(string obj) => GetHashCode(obj, CaseInsensitive);
    }

Points of Interest

When you think something is too hard or you need to split or substring or do some crazy manipulation, think twice, this is what we advise.

This code can easily handle patterns of case where one may have used snake_Case in one scenario and Pascalcase in another. You can also handle White Space variations of the same WhiteSpace textual convention.

Be careful with surrogate characters and the essence of the communitive property combined with the overflow property which in certain [niche] cases would allow for hash code collisions.

You can use this algorithm to provide encryptions or compressions:

["This","Is","Text"], ["This","Is","Text","Also"],["So","Is","This","Text"], etc.

[0,1,2],[0,1,2,3], [4,1,0,2]

You can use this algorithm to parse padded or incomplete textual conventions:

"0x 00 00 00 01","0x_00_00_00_01", "                   1", etc.

Try It Online

Academia / Acknowledgements

FMHash - Another hash function using the same name with a different problem domain
String Hashing - Other attempts at string hashing
Minimal Perfect Hash Function - A regular hash function turns a key (a string or a number) into an integer.
Code Golf - A code golfing experiment
nVidia - How to use strings in Kernel / Device Functions
See also - A Simple Hash Table on the GPU

History

31^st January, 2021: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)