Introduction
In this article I would like to provide an easy to use, simple solution to convert spaceless strings into meaningful sentences. My solution could be further optimized,
nevertheless it works perfectly fine for relatively short strings.
Background
To illustrate the problem, let's look at the following spaceless string:
"ihateattention".
If we iterate through a dictionary, the first word to be found would probably be "ate" or "at" so if we simply process our sample this way,
the result would look something like this:
'I h ate attention"
To exclude this obviously flawed solution we can prioritize results
that use up all (or most of) the characters, like:
'I hate attention"
This is the 1. criteria.
Unfortunately there could be many results that meet this criteria:
"I hat eat tent ion"
"I hate at tent ion"
It's easy to see that short (2-3 character long) words can often come up by chance. The most simple approach is to prioritize solutions with the longest words.
This is the 2. criteria I used. As it turned out, these conditions provide the correct solution most of the times.
The next issue is to compute all the possible permutations. For this I used recursion in my code.
Following are the steps in the recursive method:
- Make a sorted dictionary of the possible substrings
- Sort the dictionary descending by length (Implementing the 2. criteria)
- If the sorted dictionary is empty then return the input string and mark it as nonsense
- Iterate through all the entries in the sorted dictionary
- For each entry divide the string into 3 parts:
- left (before the entry),
- entry,
- right (after the entry)
- Mark the entry as meaningful and add it to the possible resultlist
- Call our recursive method on the left string
- Call our recursive method on the right string
- Combine all the results and check if it's better than the combined results from the previous entries (by comparing the numbers of the
meaningful characters: implementing the 1. criteria). If it is, then we keep it as the best solution (so far).
- Return the best solution.
This method works as intended, but it could be very slow for long strings, because the
algorithm computes the same substrings many times. To avert this
problem I made an array
to store all the processed substrings. The recursive method then starts by checking the corresponding entry in the generated substring list and if it isn't empty, returns its value.
This optimization boosts the performance by magnitudes.
Using the code
The implementation consists of a main class and a nested one. I needed a class that makes strings trackable by storing their relative index to the initial input string, plus
a variable to mark if a string has sense. The nested IndexedString class implements these features.
public class WordSplitter
{
string[] Dictionary;
List<IndexedString>[,] Substringsbuffer;
public WordSplitter(string[] Dictionary)
{
this.Dictionary = Dictionary.OrderByDescending(x=>x.Length).ToArray();
}
public string SplitToWords(string Input)
{
Substringsbuffer = new List<IndexedString>[Input.Length+1,Input.Length+1];
IndexedString stringtosplit = new IndexedString(0, Input, false);
var sortedlist = recursiveWordSearch(stringtosplit, Dictionary).OrderBy(x => x.Index);
string result = "";
foreach (var word in sortedlist)
{
result = result + word.Text + " ";
}
return result;
}
private List<IndexedString> recursiveWordSearch(
IndexedString Stringtosplit,
string[] Dictionary)
{
int length = Stringtosplit.Text.Length;
if (Substringsbuffer[Stringtosplit.Index, length + Stringtosplit.Index] != null)
return Substringsbuffer[Stringtosplit.Index, length + Stringtosplit.Index];
List<IndexedString> result = new List<IndexedString>();
result.Add(Stringtosplit);
string[] newdictionary = Dictionary.Where(x => Stringtosplit.Text.Contains(x)).ToArray();
if (newdictionary.Length < 1)
{
return result;
}
foreach (string entry in newdictionary)
{
List<IndexedString> temporarylist = new List<IndexedString>();
IndexedString[] devidedBytheEntry = splitByEntry(Stringtosplit, entry);
IndexedString left = devidedBytheEntry[0];
IndexedString middle = devidedBytheEntry[1];
IndexedString right = devidedBytheEntry[2];
temporarylist.Add(middle);
temporarylist.AddRange(recursiveWordSearch(left, newdictionary));
temporarylist.AddRange(recursiveWordSearch(right, newdictionary));
var temporaryScore = temporarylist.Where(x => x.Word == true);
var currentScore = result.Where(x => x.Word == true);
if (temporaryScore.Select(
x => x.Text.Length).Sum() > currentScore.Select(
x => x.Text.Length).Sum())
{
result = temporarylist;
}
}
Substringsbuffer[Stringtosplit.Index, length + Stringtosplit.Index] = result;
return result;
}
private IndexedString[] splitByEntry(IndexedString Source, string Entry)
{
int indexofentry = Source.Text.IndexOf(Entry);
IndexedString[] result = new IndexedString[3];
int leftindex = Source.Index;
int entryindex = Source.Index + indexofentry;
int rightindex = Source.Index + indexofentry + Entry.Length;
string leftstring = Source.Text.Substring(0, indexofentry);
string entrystring = Entry;
string rightstring = Source.Text.Substring(indexofentry + Entry.Length);
result[0] = new IndexedString(leftindex, leftstring, false);
result[1] = new IndexedString(entryindex, entrystring, true);
result[2] = new IndexedString(rightindex, rightstring, false);
return result;
}
private class IndexedString
{
public int Index { get; set; }
public string Text { get; set; }
public bool Word { get; set; }
public IndexedString(int Index, string Text, bool Word)
{
this.Index = Index;
this.Text = Text;
this.Word = Word;
}
}
}
To convert strings, you need to make an instance of the WordSplitter
class by passing a "dictionary" string array, then call the public
SplitToWords()
method.
dictionary = new string[5] { "hat", "bob", "a", "green","has" };
WordSplitter WS = new WordSplitter(dictionary);
string test = WS.SplitToWords("Bobhasagreenhat".ToLower());
History
- Update (1 Jan 2014): I excluded the order by descending query from the recursive method, it needs to
be called only once, so it is executed in the constructor now. I added more comments to clarify some functions.