Arabic Soundex

Tammam Koujan

Rate me:

4.91/5 (35 votes)

3 Dec 2012CPOL4 min read

100.9K

3.7K

An article about an Arabic version of the Soundex algorithm.

Introduction

Soundex is a phonetic algorithm for indexing names by sound. Many applications use algorithms like this to add fantastic features like the Google Spelling Corrections and MS Word Autocorrect.

Background

Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. And, only a few programs support this feature for Arabic language (like the Google Spelling Corrections). You can find a lot of examples of English Soundex here in CodeProject or on the Internet. But what about for the Arabic language? Really, the resources for using Soundex with the Arabic language are rare, and after long search and study, I found some academic researches. And, I think this article is the first article illustrating Arabic Soundex. The Soundex algorithm stands on grouping similar sounding letters depending on special sounding features, as follows:

To encode a word, the algorithm holds the first letter of the word, then replaces the consonants after this character with the digital values in the previous table, vowels, and the characters (h, w, y) ignored because it makes some confusion or ambiguity when accompanied with other characters.

And, collapse adjacent identical digits into a single digit of that value. This is the simplest description for Soundex, and there is another version of this algorithm with some improvements in the encoding (as an example, some versions replace the “x” character with “ecs” before encoding). So, what we will do is:

Hold the first letter.
Replace the characters (a, e, I, o, u, h, w, y) with the value 0.
Replace the characters (b, f, p, v) with the value 1.
Replace the characters (c, g, j, k, q, s, x, z) with the value 2.
Replace the characters (d, t) with the value 3.
Replace the character (l) with the value 4.
Replace the characters (d, t) with the value 5.
Replace the character (r) with the value 6.

After that, we save the code for each word , and when the user enters a word to search, we look for the word which has the same sound code.

For the Arabic language, I did a lot of searches for an Arabic version of Soundex until I found a research document of a team of five professors at the Illinois Institute of Technology. This paper was my base to write this Arabic Soundex.

What they do is:

Hold the first letter.
Replace the characters (ا, أ, إ, آ, ح, ع, غ, ش,و,ي) with the value 0.
Replace the characters (ف, ب) with the value 1.
Replace the characters (خ, ج, ز, س, ص, ظ, ق, ك) with the value 2.
Replace the characters (ت, ث,د,ذ,ض,ط) with the value 3.
Replace the character (ل) with the value 4.
Replace the characters (م, ن) with the value 5.
Replace the character (ر) with the value 6.

But this strategy is still not good enough to give perfect results like the English one, so I do some improvements to the research and I got better results. My changes were:

Remove the ( ا, أ, آ, إ ) characters from the beginning of the word if found, because I noticed they added more confusion.
Ignore the first character handling: in English language, it is important to handle the first character, but I noticed that in Arabic, there are many words with the same sound but with different first characters, so I ignored the first letter handling.
Update the character sound categories by removing or adding some characters.

Using the code

public static string ArComputeintial(string word, int length)
{
    // Value to return
    string value = "";


    switch (word[0])
    {
        case 'ا':
        case 'أ':
        case 'إ':
        case 'آ':
            {
                word = word.Substring(1, word.Length - 1);
            }
            break;

    }

    // Size of the word to process
    int size = word.Length;
    // Make sure the word is at least two characters in length
    if (size > 1)
    {

        // Convert the word to character array for faster processing
        char[] chars = word.ToCharArray();
        // Buffer to build up with character codes
        StringBuilder buffer = new StringBuilder();
        buffer.Length = 0;
        // The current and previous character codes
        int prevCode = 0;
        int currCode = 0;
        // Ignore first character and replace it with fixed value

        buffer.Append('x');
       
        // Loop through all the characters and convert them to the proper character code
        for (int i = 1; i < size; i++)
        {
            switch (chars[i])
            {
                case 'ا':
                case 'أ':
                case 'إ':
                case 'آ':
                case 'ح':
                case 'خ':
                case 'ه':
                case 'ع':
                case 'غ':
                case 'ش':
                case 'و':
                case 'ي':
                    currCode = 0;
                    break;
                case 'ف':
                case 'ب':
                    currCode = 1;
                    break;
                
                case 'ج':
                case 'ز':
                case 'س':
                case 'ص':
                case 'ظ':
                case 'ق':
                case 'ك':
                    currCode = 2;
                    break;
                case 'ت':
                case 'ث':
                case 'د':
                case 'ذ':
                case 'ض':
                case 'ط':
                    currCode = 3;
                    break;
                case 'ل':
                    currCode = 4;
                    break;
                case 'م':
                case 'ن':
                    currCode = 5;
                    break;
                case 'ر':
                    currCode = 6;
                    break;
            }

            // Check to see if the current code is the same as the last one
            if (currCode != prevCode)
            {
                // Check to see if the current code is 0 (a vowel); do not process vowels
                if (currCode != 0)
                    buffer.Append(currCode);
            }
            // Set the new previous character code
            prevCode = currCode;
            // If the buffer size meets the length limit, then exit the loop
            if (buffer.Length == length)
                break;
        }
        // Pad the buffer, if required
        size = buffer.Length;
        if (size < length)
            buffer.Append('0', (length - size));
        // Set the value to return
        value = buffer.ToString();
    }
    // Return the value
    return value;
}

Points of Interest

This is a simple version of the algorithm and regards the complex nature of the Arabic language. We can do additional research to improve it.

I believe that my algorithm needs more tests to ensure it is working correctly. This is my first article in The Code Project. I hope you it gives you some new ideas and sorry for my language errors.

More ...

A research about Text similarities for Arabic language written by Moath Ibrahim Al-hadlaq at Al-Imam Muhammad Ibn Saud Islamic University in Kingdom of Saudi Arabia contain a very good information and methods for Arabic phonetic algorithm, it also include an improvement version of my Soundex Arabic algorithm.

the research available to download from here:

ftp://ftp3.ie.freebsd.org/pub/sourceforge/t/project/te/textsimilaritie/Text_Similarities.pdf

also I have attached a copy into downloads.