Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / Java

Anagrams - Fun with Words: Part 2

5.00/5 (3 votes)
28 Jun 2018CPOL7 min read 12.1K   145  
Anagram Generator

Image 1

Introduction

In my previous article, Fun With Words Part 1, I showed you an algorithm for generating palindromes, phrases that spell the same thing forward and backward. I attempted to develop an algorithm to generate anagrams, a word or phrase formed by rearranging the letters of another, for example, "Old West Action" is an anagram of "Clint Eastwood". However, my attempt resulted in a slow program that did not create all possible anagrams for a given input. Therefore, I found a very interesting algorithm by Parth Parekh, and in this Part 2, I converted Parth's Java code to C#. I will explain Parth's excellent algorithm and point out some of the differences between Java and C#.

Using the Code

Download and extract the project, and open AnagramGenerator.sln with Visual Studio. It consists of three projects:

  1. AnagramGenerator - the main program, a Console project for generating anagrams via the command line
  2. AnagramUnitTestProject - a Unit Test Project
  3. WebServiceLib - a Web Service Library to call an online grammar checker API to check the validity of a phrase or sentence

Build the solution by pressing F6.

Data Structures

The first data structure I want to discuss is the HashSet<String>. The HashSet is a collection that contains no duplicate elements, and whose elements are in no particular order. In this example, the set contains only strings. What then, would a HashSet<HashSet<String>> be? It is a collection that contains collections - that contain strings. The Java equivalent is also called a HashSet, which has a method to combine two collections called addAll() that .NET lacks. The second data structure is a .NET class called SortedDictionary which is a collection of key/value pairs that are sorted by key. I use this in the code where Parth uses the Java TreeMap.

Word List Files

In his gitub project, Parth supplies a file called wordlist.txt that contains over 100,000 words. This results, in my humble opinion, of far too many anagrams for a given word, so I have supplied a file called 3000CommonWords.txt with 3000 common English words that results in fewer anagrams, but ones that are more readable. I have also supplied a file called testingWords.txt for testing.

An enhancement I've added is the ability to use an online grammar checker to ascertain the grammatical correctness of the generated anagram. The code for this capability is in a library called WebServiceLib, described below.

Command Line Arguments

Three command line arguments are provided to the program:

  1. The name of the word list file, a text file with one word per line
  2. The minimum word length of words to use from the word list file
  3. The word or phrase to generate anagrams for. Parth's code appends arguments 3 and greater to create this phrase; for simplicity, this version just assumes the third argument is in quotes as shown in the example in the image above.

Code Detail

The class SortedWordDictionary contains a method called loadDictionaryWithSubsets(). It reads a word list file and build a data structure called SortedDictionary<String, HashSet<String>>sortedStringMap. As mentioned previously, in the Java code, a TreeMap is used instead of the .NET SortedDictionary. First, let's look at the test file testingWords.txt. It contains the words old, west, action, stew, meat, team, mate, cation, wets. (A cation, in case you were wondering, is a positively charged ion.) Using the words "Clint Eastwood" as the phrase we want to generate anagrams for, blanks and punctuation are removed using a method called Canonical(), and stored as an array of characters called charInventory. loadDictionaryWithSubsets first ignores any word that is not a subset (as determined by the isSubset() method). Thus, meat, team, mate are ignored as they are not subsets of our charInventory. For the remaining words, the key in sortedStringMap is the word sorted alphabetically by the sortWord() method. The value is a HashMap that contains all the anagrams of the word, so after processing the file, the sortedStringMap looks like this:

Key Value
acinot action, cation
dlo old
estw west, stew, wets

The getDictionaryKeyList gets a List<String> of the keys, namely acinot, dlo, estw. (Java has a method called keySet() to do this, so we will call this the "key set".) Next, in FindAllAnagrams() we iterate over these keys, calling the findAnagrams() method. findAnagrams() recursively finds anagrams in the charInventory array. findAnagrams() finds anagrams for entire charInventory word using the isEquivalent() method. It also finds anagrams with multiple words using the isSubset() method. So starting with our charInventory of the characters in "clinteastwood", we take our first key, acinot, note that it is a subset, and remove all those characters from the charInventory leaving "lestwod". We continue recursively, with the next key dlo, and remove those characters from the charInventory leaving just "estw". Finally, the characters in the last key estw are removed from charInventory and findAnagrams() returns. Note that for the last key estw, isEquivalent is true. When findAnagrams() returns to the caller, the dictWordAnagramsSet contains the three keys. The mergeAnagramKeyWords() method merges the actual word list file word – for example, the actual word file word "action" for the key "acinot" – for all the words in the key set. So in mergeAnagramKeyWords(), before the call to setMultiplication() anagramsSet contains 3 sets with just the actual words:

  1. west, stew, wets
  2. old
  3. action, cation

The setMultiplication() method returns the Cartesian product (generating the Cartesian product could be simplified using LINQ.) At last, we have the 6 anagrams when FindAllAnagrams() returns:

action old west
cation old west
action old stew
cation old stew
action old wets
cation old wets

The reason they differ somewhat from Parth's (in his output, the word "old" appears first) is that in the Java mergeWordToSets() method, the order of mergedSets is dlo estw and in this C# version, it's the opposite, estw dlo. This is because the Add method for a HashSet assumes no particular order. I've added a call to a randomizer in mergeAnagramKeyWords(), an extension method called Shuffle() so subsequent calls to the AnagramGenerator will yield slightly different results.

Web Service Library

The WebServiceLib class has a method public String CheckGrammar(String sentence) that returns a string from an online grammar checker for a given sentence. There are many online grammar checkers available, and I've selected one that although it is free, requires registration to obtain a key. See WebService.cs for the home page URL of the service I used. If you have a key, replace the line in WebService.cs that says String appKey = "Your Key Here"; with your key. As an example, the Web Service API return value for "Old west action." is:

JavaScript
{ "status":"1","message":"Successfully checked","error_grammar_count_total":0,
  "error_grammar_percent":"0%",
"check_grammar_feedback":[]}

the 0% error indicating it is grammatically correct. A less grammatically correct sentence will show a higher percentage error. In the Main program, I show how to call the Web Service and extract the percentage from the Web Service API return value and display it along with the generated anagrams (set grammarCheck to true in the Main program to enable grammar checking). The output from the program when including the results of calling the grammar check API is shown in the example in the image above.

Results

Here are the results of using "Clint Eastwood" and the two Word List Files:

Word List File Number of Anagrams found
3000CommonWords.txt 227
wordList.txt 59647

Conclusion

Parth's algorithm is fast and elegant, and converting from Java to C# was straightforward. I found C#'s lack of the Java addAll() method to be irksome, but the .NET SortedDictionary, ToList(), and ToCharArray() more than made up for it.

Visual Studio (VS) and Eclipse are the two most popular IDEs in use today. Eclipse has some interesting features. For example, to emulate the VS, "Go to definition" (F12) command, in Eclipse you simply hold the Ctrl key down and hover the mouse pointer over the item to be defined.

History

  • Version 1.0.0.0

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)