I assume you know what cryptograms are. A cryptogram breaking program can be thought of as two programs working together.
- One that generates a sequence of possible solutions
- One that evaluates each solution
Generator:
do
generate a random initial permutation
do
for each pair of letters in the permutation
swap the two letters
evaluate the permutation
if worse, then swap them back
end for
while there was an improvement
if this was the best solution then save it
until you are bored
Evaluator:
for each subsequence of 4 letters (each 4-gram) in the solution
add the cost associated with the subsequence
end for
Where the cost associated with each 4-letter subsequence is proportional to the inverse of its frequency in the book text. If the frequency is zero then assign some minimal baseline frequency. This evaluation method is a limiting point of the chi-square cost after assuming that each 4-gram has appeared infinite times. This assumption is blatantly wrong, but it makes the evaluation linear and therefore fast.
Evaluation example:
if the book text is:
"to be or not to be"
and the cryptogram solution for evaluation is:
"...qwe rwzxcrb..." --> "...for nothing..."
then the cost would be:
...
+ cost (for ) = 1/epsilon
+ cost (or n) = 1/1
+ cost (r no) = 1/1
+...
because "for " never occurs in the book text and "or n" and "r no" each appear once.
For real, you would use a *whole book* for the book text.
Hypothetical question/answer:
Q: Why do you use 4-grams and not 3 or 5?
A: Because a precomputed table of 4-gram cost fits in 4 megs of RAM, which is a reasonable amount to expect people to have. Lower counts give worse results and higher counts require using a hash table which would make the code more complicated and would add the complication of deciding what the hash table size should be. I've found that you get improvements up to 6-grams with cryptograms with spaces and up to 5-grams with cryptograms without spaces. After that I believe the book text becomes the limiting factor and other methods of solving would be more appropriate.
Q: Can the program be modified to solve cryptograms without spaces between the words?
A: Yes, in fact the code is simpler than solving ones with spaces. Just tweak the function that filters the text to not bother with spacing. Similarly, you can solve cryptograms where the space is like a 27th letter simply by looping through 27 elements instead of 26 when shuffling and letter swapping.
Q: Have you applied these methods to the cryptanalysis of other classical ciphers?
A: No. If you want to then please go ahead - that's what this article and code is for.
Q: Your code looks like something my dog threw up! Can I suggest some modifications?
A: Yes please. This is actually the first time I've written code that someone else might look at and I would appreciate any suggestions. No optimisations though unless it also makes the code smaller and cleaner and gives me a warm and fuzzy feeling.
Q: I ran your sample program but it just gives an error message and then quits. What gives?
A: The program is useless without a sample text to read. If you want it to break shakespearean quotes then get some from the web and rename the file to "book.txt" and put it in the same directory as the exe. Or if you want it to do french then rename "cyrano.txt" to "book.txt" and put it in the same folder.
Q: How can I save the output of the sample program?
A: First make sure the program is stopped (as in not trying to break the cryptogram anymore) and then cut and paste the text from the window.