Introduction
Soundex is a phonetic algorithm for indexing names by sound. Many applications use algorithms like this to add fantastic features like the Google Spelling Corrections and MS Word Autocorrect.
Background
Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. And, only a few programs support this feature for Arabic language (like the Google Spelling Corrections). You can find a lot of examples of English Soundex here in CodeProject or on the Internet. But what about for the Arabic language? Really, the resources for using Soundex with the Arabic language are rare, and after long search and study, I found some academic researches. And, I think this article is the first article illustrating Arabic Soundex. The Soundex algorithm stands on grouping similar sounding letters depending on special sounding features, as follows:
To encode a word, the algorithm holds the first letter of the word, then replaces the consonants after this character with the digital values in the previous table, vowels, and the characters (h, w, y) ignored because it makes some confusion or ambiguity when accompanied with other characters.
And, collapse adjacent identical digits into a single digit of that value. This is the simplest description for Soundex, and there is another version of this algorithm with some improvements in the encoding (as an example, some versions replace the “x” character with “ecs” before encoding). So, what we will do is:
- Hold the first letter.
- Replace the characters (a, e, I, o, u, h, w, y) with the value 0.
- Replace the characters (b, f, p, v) with the value 1.
- Replace the characters (c, g, j, k, q, s, x, z) with the value 2.
- Replace the characters (d, t) with the value 3.
- Replace the character (l) with the value 4.
- Replace the characters (d, t) with the value 5.
- Replace the character (r) with the value 6.
After that, we save the code for each word , and when the user enters a word to search, we look for the word which has the same sound code.
For the Arabic language, I did a lot of searches for an Arabic version of Soundex until I found a research document of a team of five professors at the Illinois Institute of Technology. This paper was my base to write this Arabic Soundex.
What they do is:
- Hold the first letter.
- Replace the characters (ا, أ, إ, آ, ح, ع, غ, ش,و,ي) with the value 0.
- Replace the characters (ف, ب) with the value 1.
- Replace the characters (خ, ج, ز, س, ص, ظ, ق, ك) with the value 2.
- Replace the characters (ت, ث,د,ذ,ض,ط) with the value 3.
- Replace the character (ل) with the value 4.
- Replace the characters (م, ن) with the value 5.
- Replace the character (ر) with the value 6.
But this strategy is still not good enough to give perfect results like the English one, so I do some improvements to the research and I got better results. My changes were:
- Remove the ( ا, أ, آ, إ ) characters from the beginning of the word if found, because I noticed they added more confusion.
- Ignore the first character handling: in English language, it is important to handle the first character, but I noticed that in Arabic, there are many words with the same sound but with different first characters, so I ignored the first letter handling.
- Update the character sound categories by removing or adding some characters.
Using the code
public static string ArComputeintial(string word, int length)
{
string value = "";
switch (word[0])
{
case 'ا':
case 'أ':
case 'إ':
case 'آ':
{
word = word.Substring(1, word.Length - 1);
}
break;
}
int size = word.Length;
if (size > 1)
{
char[] chars = word.ToCharArray();
StringBuilder buffer = new StringBuilder();
buffer.Length = 0;
int prevCode = 0;
int currCode = 0;
buffer.Append('x');
for (int i = 1; i < size; i++)
{
switch (chars[i])
{
case 'ا':
case 'أ':
case 'إ':
case 'آ':
case 'ح':
case 'خ':
case 'ه':
case 'ع':
case 'غ':
case 'ش':
case 'و':
case 'ي':
currCode = 0;
break;
case 'ف':
case 'ب':
currCode = 1;
break;
case 'ج':
case 'ز':
case 'س':
case 'ص':
case 'ظ':
case 'ق':
case 'ك':
currCode = 2;
break;
case 'ت':
case 'ث':
case 'د':
case 'ذ':
case 'ض':
case 'ط':
currCode = 3;
break;
case 'ل':
currCode = 4;
break;
case 'م':
case 'ن':
currCode = 5;
break;
case 'ر':
currCode = 6;
break;
}
if (currCode != prevCode)
{
if (currCode != 0)
buffer.Append(currCode);
}
prevCode = currCode;
if (buffer.Length == length)
break;
}
size = buffer.Length;
if (size < length)
buffer.Append('0', (length - size));
value = buffer.ToString();
}
return value;
}
Points of Interest
This is a simple version of the algorithm and regards the complex nature of the Arabic language. We can do additional research to improve it.
I believe that my algorithm needs more tests to ensure it is working correctly. This is my first article in The Code Project. I hope you it gives you some new ideas and sorry for my language errors.
More ...
A research about Text similarities for Arabic language written by Moath Ibrahim Al-hadlaq at Al-Imam Muhammad Ibn Saud Islamic University in Kingdom of Saudi Arabia contain a very good information and methods for Arabic phonetic algorithm, it also include an improvement version of my Soundex Arabic algorithm.
the research available to download from here:
ftp://ftp3.ie.freebsd.org/pub/sourceforge/t/project/te/textsimilaritie/Text_Similarities.pdf
also I have attached a copy into downloads.
010011000110100101101011011001010010000001000011011011110110010001101001011011100110011100100001