Introduction
This article discuses algorithm for constructing crossword puzzles. Several techniques and heuristics are described that can reduce search time. The algorithm is implemented in F#.
Background
Crossword construction is a constraint satisfaction problem which belong to NP complexity class. To reduce search space following techniques are employed by the algorithm:
- most constrained variable heuristics - for choosing next pattern to be filled
- least constrained value heuristics - for choosing a word with which selected pattern will be filled
- forward checking - for keeping the list of available words consistent with the current state of crossword
- backjumping - for going back to pattern that is causing inconsistency
- tree pruning - for removing words that are known to lead to the dead ends
All of these subjects are discussed in following sections.
Example
For illustration purposes, simple dictionary is constructed to demonstrate certain behaviors of the algorithm.
$\scriptsize Dict = \left \{ ABA, ABB, ABC, ABD, ABE, ABF, ABG, BDA, BDB, BDC, BDD, AAAA, DDDD, CCCC, AAAAA, BBBAA, DDDBB \right \}$
The following image shows a single run of algorithm with the simple dictionary. Also random behaviors are disabled to get reproducible example. This example will be used to show different aspects of the algorithm.
|
run of the algorithm with the simple dictionary |
Crossword Representation
Crossword is represented as two-dimensional array of fields. Each field is described by several properties:
- letter - letter currently occupying the field (special cases:
'_'
- empty field and '#'
block field) - fill count - stores the number of field's patterns that are currently filled
- conflict marking - indicates that it was not possible to fill one of it's patterns on the latest try due to inconsistency.
- pattern references - index of horizontal and vertical patterns to which the field belongs
Fields are grouped into patterns on which the algorithm operates. Patterns has two important states:
- options - pool of words that are valid for current state of crossword and can be used to fill the pattern.
- stamp - step number of the algorithm in which the pattern was filled.
Horizontal and vertical patterns are kept in separate lists, for faster lookup when searching for connected patterns. Pattern reference is actually an index in these lists. Patterns also have globally unique IDs, so the algorithm can avoid using/processing the same pattern more then once in certain steps.
Pattern Selection
The first thing in each step of the algorithm is to decide which pattern should be filled next. The algorithm selects pattern which has the least number of possible words that can fill it. This is done so the algorithm can hit imminent dead ends sooner rather then later and prune bad search paths early. The heuristic is called most constrained variable.
As it was already said, the algorithm maintains list of valid words for each pattern. This list is called options of the pattern. So to select most constrained pattern it is enough to find pattern with least number of options. Side effect of this approach is that successively filled patterns do not have to be connected, which unfortunately makes backtracking more complicated, but it is worth the hassle.
Word Selection
After the pattern has been selected, then next decision is to choose which word will be used to fill the pattern. This time opposite approach is used. Word which excludes the fewest options from connected pattern is chosen, which leaves greater flexibility in further search. Least constrained value is the name of this heuristic.
Weight of the word is calculated as:
$W_{w}=\prod_{i=1}^{n}\log(1+\left | P_{i} \right |)$
where \(\left | P_{i} \right |\) is number of remaining options for \(i^{th}\) unfilled connected pattern under assumption that word \(w\) is selected. Product instead of summation is used to rule out words which would cause one of the unfilled patterns to become inconsistent (\(\left | P_{i} \right | = 0\)).
Weighting all options of the selected pattern would be too expensive, especially in the early phase of search, so only limited number of words is considered. The algorithm takes only several words from the pool and selects the one with the greatest weight. Once the word is selected it is removed pattern's options.
Forward Checking
Once the word is selected, constraints imposed by the selection are propagated to connected unfilled patterns, so only valid words remain as options for those patterns. This is called forward checking.
This step is necessary, not only for inconsistency checks, but also to get the most constrained variable needed already described. So these two things go hand in hand. It is possible to implement more sophisticated consistency checks such as arc consistency (AC-3 algorithm), but these might be expensive operations.
New set of options for unfilled patterns is already produced by the word selection, needed in order to calculate words' weights, so it will be reused by this phase. The only thing left to do is to save new pattern options.
Backjumping
Once the algorithm finds a pattern that it cannot fill, it has to go back to one of the patterns that was previously filled and fill it with another word in hope the issue will resolved. This is the most complex part of the algorithm.
Naive backtracking approach would be to pick another word for the most recently filled pattern, but it's far from the optimal way. The reason for inefficiency of simple backtracking is order in which the patterns are filled. Since the algorithm fills the most constrained pattern first, patterns filled in two successive steps might not be connected at all. If that is the case, then changing the word of the first pattern will not resolve inconsistency in the second, it will only waste CPU cycles.
Better approach is to go back only to the patterns that are connected to the one that is inconsistent. This technique is called backjumping.
To solve this issue, the algorithm stamps each filled pattern with a step number in which it was filled. This addition makes it is possible to recover the order in which patterns were filled. Now it can be determined which pattern should be changed as well as the list of all patterns affected by that change.
Steps performed by backjump are following:
- Mark all fields of inconsistent pattern as conflicted
The algorithm simply set conflict marking for each field of the inconsistent pattern.
- Identify target pattern to which the algorithm needs to backujump
Target is the most recently filled pattern that has been filled earlier than and is connected to the one which is inconsistent. To determine which pattern satisfy this condition, the algorithms uses previously described stamps. If no such pattern is found, the most recently filled pattern globally is used as target.
|
backjump to pattern that is not connected |
Steps #8 and #9 demonstrate this situation. If that fails as well, it means the search is back at the beginning with no more options available. This indicates end of search and failure to construct crossword.
- Undo all patterns that depends on the target of backjump
Once the target is located, all patterns depending on it must be undone. To generate list of these patterns, the algorithm starts from the target and adds all connected patterns that are filled after the target. The process is repeated recursively for each new pattern added to the list. Once the list is completed, fields of patterns in the list are updated with new fill count and cleared if necessary.
- Restore options for all affected patterns
List of possible words for each undone pattern must be regenerated, based on the values of underlying fields which remained filled. Undo process, that was previously described, also affects unfilled patterns connected to those that are undone. So the final list of affected patterns is union of cleared patterns and unfilled patterns connected to them. For each pattern in the list, the algorithm forms the list of words by matching the words from dictionary against the state of pattern's fields.
- Prune options of target pattern
Pruning process is described in the following section.
Tree Pruning
Once the algorithm backtracks it is possible to determine which parts of target pattern caused the problem, so options that are going to repeat the same mistake can be removed from list of valid options.
To prune options, the algorithm will construct filter based on conflicting markings of pattern's fields. As it already said, each time an inconsistent pattern is found all of it's field are marked as conflicted. These markings are preserved and accumulated across backjumps and they are cleared only when the pattern is filled successfully.
| |
mark fields of inconsistent
pattern as conflicted | clear conflict mark after
filling the pattern |
When backjump reaches its target all it needs to do is to remove words that have matches the pattern of marked fields.
|
pruning in action |
Effects of pruning can be seen between step #9 and #10. After AAAAA word had failed to produce solution, the algorithm selected DDDBB even thought there was BBBAAA before it in dictionary. The approach is rather simple and it is possible do develop more sophisticated algorithm that would prune even more options by constructing less restrictive filters, but it was not not implemented.
|
non-optimal prune filter |
This potential sequence of tries would create ___AA filter for pruning, but since those two inconsistent patterns are not dependent, much better option would be to create two filters: ___A_ and ____A as this pair would remove greater number of options.
Code
Field of crossword is represented by CwField
record:
type CwField =
{ Letter : char
HorRef : int
VerRef : int
Count : int
Conflicted : bool }
Patterns are represented by CwPattern
class:
type CwPattern(grid : CwField [,], dictionary : string array,
direction : CwDirection, no : int, xs : int, ys : int, length : int) =
member m.PruneOptions()
member m.RestoreOptions()
member m.UpdateOptions(opts : string array)
member m.Fill(word : string, step : int)
member m.Undo()
member m.MarkConflicted()
member m.Letters
member m.Conflicted
member m.Word
member m.No
member m.X
member m.Y
member m.Length
member m.Direction
member m.Options
member m.Count
member m.Stamp
Crossword is represented by Crossword
class:
type Crossword(filePath : string, dictionary : CwDictionary) =
let minPatternLength
let maxWordPoolLength
let grid
let hPatterns
let vPatterns
let orthogonal (pattern : CwPattern)
let selectPattern queue
let selectWord (pattern : CwPattern)
let backjump (start : CwPattern)
member x.Solve(refreshRate : int)
member x.Check()
member x.Print()
Dictionary and Puzzle Files
Dictionary is stored in words.txt
file, while the puzzle structure is defined by puzzle.txt
. '.'
represents free field and '#'
is block field.
References
History
2017-03-14 - Initial version posted