Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / tools

How to add spaces between words in a spaceless strings

4.88/5 (17 votes)
2 Jan 2014CPOL3 min read 65.6K  
A simple solution for processing spaceless strings

Introduction

In this article I would like to provide an easy to use, simple solution to convert spaceless strings into meaningful sentences. My solution could be further optimized, nevertheless it works perfectly fine for relatively short strings.

Background

To illustrate the problem, let's look at the following spaceless string:

"ihateattention". 

If we iterate through a dictionary, the first word to be found would probably be "ate" or "at" so if we simply process our sample this way, the result would look something like this:

'I h ate attention" 

To exclude this obviously flawed solution we can prioritize results that use up all (or most of) the characters, like:

'I hate attention"

This is the 1. criteria.

Unfortunately there could be many results that meet this criteria:

"I hat eat tent ion"
"I hate at tent ion" 

It's easy to see that short (2-3 character long) words can often come up by chance. The most simple approach is to prioritize solutions with the longest words. This is the 2. criteria I used. As it turned out, these conditions provide the correct solution most of the times.

The next issue is to compute all the possible permutations.  For this I used recursion in my code.

Following are the steps in the recursive method:

  • Make a sorted dictionary of the possible substrings
  • Sort the dictionary descending by length (Implementing the 2. criteria)
  • If the sorted dictionary is empty then return the input string and mark it as nonsense
  • Iterate through all the entries in the sorted dictionary
    • For each entry divide the string into 3 parts:
      • left (before the entry),
      • entry,
      • right (after the entry)
    • Mark the entry as meaningful and add it to the possible resultlist
    • Call our recursive method on the left string
    • Call our recursive method on the right string
    • Combine all the results and check if it's better than the  combined results from the previous entries  (by comparing the numbers of the meaningful characters: implementing the 1. criteria). If it is, then we keep it as the best solution (so far).
  • Return the best solution.

This method works as intended, but it could be very slow for long strings, because the algorithm computes the same substrings many times.  To avert this problem I made an array to store all the processed substrings. The recursive method then starts by checking the corresponding entry in the generated substring list and if it isn't empty, returns its value. This optimization boosts the performance by magnitudes.

Using the code 

The implementation consists of a main class and a nested one. I needed a class that makes strings trackable by storing their relative index to the initial input string, plus a variable to mark if a string has sense. The nested IndexedString class implements these features.

C#
public class WordSplitter
{
    string[] Dictionary;

    //To boost performance
    List<IndexedString>[,] Substringsbuffer;

    public WordSplitter(string[] Dictionary)
    {
        this.Dictionary = Dictionary.OrderByDescending(x=>x.Length).ToArray();   
    }

    public string SplitToWords(string Input)
    {

        Substringsbuffer = new List<IndexedString>[Input.Length+1,Input.Length+1];
       
        IndexedString stringtosplit = new IndexedString(0, Input, false);

  
        var sortedlist = recursiveWordSearch(stringtosplit, Dictionary).OrderBy(x => x.Index);
        string result = "";
        foreach (var word in sortedlist)
        {
            result = result + word.Text + " ";
        }

        return result;
    }

    //
    // Private Methods
    //
    private List<IndexedString> recursiveWordSearch(
        IndexedString Stringtosplit, 
        string[] Dictionary)
    {
        //
        // Checking the buffer
        //
        int length = Stringtosplit.Text.Length;
        if (Substringsbuffer[Stringtosplit.Index, length + Stringtosplit.Index] != null)
            return Substringsbuffer[Stringtosplit.Index, length + Stringtosplit.Index];

        //
        // Initializing result list
        //
        List<IndexedString> result = new List<IndexedString>();
        result.Add(Stringtosplit);

        //
        // Narrowing the dictionary
        //
        string[] newdictionary = Dictionary.Where(x => Stringtosplit.Text.Contains(x)).ToArray();
       
        //
        // Trivial case
        //
        if (newdictionary.Length < 1)
        {
            return result;
        }
     
        //
        // Non trivial case
        //
        foreach (string entry in newdictionary)
        {
            List<IndexedString> temporarylist = new List<IndexedString>();

            //
            // Deviding the string to 3 parts: the entry, left and right 
            //
            IndexedString[] devidedBytheEntry = splitByEntry(Stringtosplit, entry);

            IndexedString left = devidedBytheEntry[0];
            IndexedString middle = devidedBytheEntry[1];
            IndexedString right = devidedBytheEntry[2];

            //
            // Calling the method on the other two parts recursively
            //
            temporarylist.Add(middle);
            temporarylist.AddRange(recursiveWordSearch(left, newdictionary));
            temporarylist.AddRange(recursiveWordSearch(right, newdictionary));

            //
            // Comparing current score and temporary score
            //
            var temporaryScore = temporarylist.Where(x => x.Word == true);
            var currentScore = result.Where(x => x.Word == true);

            if (temporaryScore.Select(
                x => x.Text.Length).Sum() > currentScore.Select(
                x => x.Text.Length).Sum())
            {
                result = temporarylist;
            }
        }
        Substringsbuffer[Stringtosplit.Index, length + Stringtosplit.Index] = result;
        return result;
    }

    private IndexedString[] splitByEntry(IndexedString Source, string Entry)
    {

        int indexofentry = Source.Text.IndexOf(Entry);
        IndexedString[] result = new IndexedString[3];

        //
        // Compute realitve indexes
        //
        int leftindex = Source.Index;
        int entryindex = Source.Index + indexofentry;
        int rightindex = Source.Index + indexofentry + Entry.Length;

        //
        // Generate substrings
        //
        string leftstring = Source.Text.Substring(0, indexofentry);
        string entrystring = Entry;
        string rightstring = Source.Text.Substring(indexofentry + Entry.Length);

        //
        // Generate results
        //
        result[0] = new IndexedString(leftindex, leftstring, false);
        result[1] = new IndexedString(entryindex, entrystring, true);
        result[2] = new IndexedString(rightindex, rightstring, false);
        return result;
    }

    private class IndexedString
    {
        public int Index { get; set; }
        public string Text { get; set; }
        public bool Word { get; set; }

        public IndexedString(int Index, string Text, bool Word)
            
        {
            this.Index = Index;
            this.Text = Text;
            this.Word = Word;
        }
    }
}

To convert strings, you need to make an instance of the WordSplitter class by passing a "dictionary" string  array, then call the public SplitToWords() method.

C#
dictionary = new string[5] { "hat", "bob", "a", "green","has" };
WordSplitter WS = new WordSplitter(dictionary);
string test = WS.SplitToWords("Bobhasagreenhat".ToLower()); 

History

  • Update (1 Jan 2014):  I excluded the order by descending query from the recursive method, it needs to be called only once, so it is executed in the constructor now. I added more comments to clarify some functions.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)