Breaking Vigenère cipher with C++

Daniel Ramnath

5.00/5 (6 votes)

8 Jul 2024CPOL8 min read

6.9K

Breaking Vigenère cipher with C++, believed to be unbreakable for two centuries.

Introduction

Breaking/Cracking the Vigenere-cipher using C++. Breaking messages which are encrypted with substitution algorithm.

For almost two centuries Vigenere Ciphers where thought to be unbreakable. One day Major Friedrich Wilhelm Kasiski (29 November 1805 – 22 May 1881) a German infantry officer, cryptographer and archeologist publish a general method of deciphering Vigenère ciphers. Once people believed spider has six legs until one day somebody found that the spider has eight legs.

“How many legs does a spider have? Eight? Do you really know? I mean have you counted them? In 300 B.C. Aristotle said that spiders had six legs and was classified as an insect. All the world believed him, until finally in the 1400s somebody actually counted and saw they had not six, but eight legs. Aristotle must have been widely respected for no one to question him for 1,700 years. I am sure he was right about a lot of things, but not this time. Finally, somebody counted the legs for themselves instead of just taking Aristotle’s word for it.” —William Earnhardt1

Background

What is substitution cipher?

A message is a plain text that is readable or understandable by humans. Encrypted message or cipher texts are the text that is not readable or unreadable for humans’ means the text is gibberish. A key is another random text or numbers chosen by users. The size of the key can vary from one to multiple lengths. In order to change a plain text to an encrypted message text we need a key.

How do we encrypt a message?

Message and key are combined together with some operation to form an encrypted message. For example,

message + key = cipher text.

Message and key are combined with an addition operation to form a cipher text. An operation can also be as simple operation like addition (+), subtraction (-), multiplication (*) or division (/), or something it can be little complex operation like MOD (modulus or remainder), XOR, matrix operation etc. The operation can also be a shift operation. A character is replaced or substituted or swapped with certain number of positions to get a cipher text is called substitution cipher. In this case Key will be the number of positions to be moved in the alphabetical series of any language. For eg.

If the key is 3 then every character is shifted to third position in the alphabetical series.

A becomes C

B becomes D

...

X becomes Z

Y becomes A

Z becomes B

Aa shown in the image below.

The message ‘GOD’ becomes ‘JQF’. This is called Caesar cipher, the method is named after Julius Caesar, who used it in his private correspondence.

How do we break this? Using brute force technique by choosing the key from 1 to 26 and also we have to visually look for a meaningful word or sentence. But computers don’t have visuals nor does it have any understanding of its own, unless we are in the world of AI or machine learning.

In the above example key length is one. What happens if the key length is more than one that is If the key length is greater than 3? For example, let the key length be 5 and the key be 3, 0, 6, 4, 9 shifts, that is, the position to shift the first character is shifted 3 positions in the alphabetical series, second character is shift to zero position, and third character is shift to 6 positions, so on and so forth. The key length is 5 and the key repeats itself. Let the message be M and key be K and the cipher text as given below be,

M := GODISGREATALLTHETIMEALLTHETIMEGODISGREAT...

K := 3064930649306493064930649306493064930649...

C := IOILAIRJDBCLQWPGTNPMCLQWPGTNPMIOILAIRJDB...

Note that the key with 0 is not shifted. Character ‘G’ is shifted three position to get ‘I’, ‘D’ Is shifted six position to get ‘I’, ‘I’ is shifted four position to get ‘L’ etc. This is called Vigenere Ciphers. This algorithm was devised during pre-computers era. What if the messages are in a different language other than English?

Welcome to the world of mathematics. Frequency analysis is a method, which almost address the above issues. Issue of visualization, issue of understanding, issue with multiple key lengths, and the issue of language barriers (even if the message is in different language other than English).

What is frequencies analysis?

The wiki definition is

"Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. 801–873 AD), who formally developed the method to break ciphers. Later letter frequency analysis gained importance in Europe"

Let me give an example. In English sentence 'E or e' occurs most often followed by 'T or t', 'A or a' etc. The following chart shows frequencies count of the occurrence of each character in English.

Consider an English sentence containing 100 letters, From the above chart it shows ‘E’ occurs 12 or 13 times, ‘T’ occurs 9 or 10 times etc. Frequencies count for French is as below,

Similarly there are frequency table available for Arabic, Tamil, Sanskrit, Japanese, etc. etc.

How to crack? At the bottom of the article I have given the C++ code that cracks the Vigenere cipher. But the explanation in this article will be little different so that, to make it simple even novice can understand.

First find the key length using the given cipher text. With the statistical approach it is easy to find the key length. I will not get into the details which I may end up losing simplicity while giving explanation. Please refer the C++ code for understanding. In the code ‘FindKeyLength’ function finds the length of the key.

For every interval of key length calculate the occurrence of every character from the cipher or encrypted text. Do this for each position of key length. For example in the above example we have the key length 5 and the cipher text C is,

C := IOILAIRJDBCLQWPGTNPMCLQWPGTNPMIOILAIRJDB...

The first characters after every interval of 5 are as follows.

C := I****I****C****G****C****G****I****I****...+

Now find the count of occurrence of every character and build the first table.

Character Occurrence Table In the first position

'C', 2+ times

'G' 2+ times

'I' 4+ times

...

Similarly, the second characters after every interval of 5 are.

C : *O****R****L****T****L****T****O****R***...+

Count the occurrence of every character and build a second table

Character Occurrence Table in the Second position

'L' 2+ times

'O' 2+ times

'R' 2+ times

'T' 2+ times

...

The above cipher text is taken from a longer encrypted message. The original encrypted message is

Quote:

IOILAIRJDBCLQWPGTNPMCLQWPGTNPMIOILAIRJDBNOAHQUNYMCUTXRUGTMLVIGTGLQE

XUIVHJUTQVJRZKGNQIVEXIZQMMLUKTNVBJEKRCPDFWQQNTIPKSHKITAHWMTYTXKCNSR BUELPMPTXXJVRFFBQRFOBGRNWMXEWBBJISJOQDNVQPCQXLKNLKQUJZVBKCJDVFMJUK AIXUWQTJGQPTMLAWNFOBGRFETGFFFBIOILANOAHOQDNVTQVJWPGRJIWTEJYMTYIHKKS NRVJERDSGSNVKQNXLAVESWEKTMWPKSYUCVHFQLVHJHBGRSDTFEXWQPYKRZGVJUGRE WVWPWMRPCSJYMTLNYMFWNOTDEKDQTWJVWOEYLUGSLHBKNYRBTOZETGBDDXRLDLVI HZPIPLTJQETTVQVUFWQQNXWPCTHDTNFTULKVNQMYIXGWOIAHNQUSGOTEFWAQLFFMK NORPPISNVQWNQOVHFWMXEWBBJISJIDOZWOQDKOWYSKUWOASXVCLYHZCBQHZGAQL BAOKSMTFJFBNOAHLKVNQMNOAHUCKJVQVSRRAVTTXKJISJIRPJDTUTTWPGHJDZVWMHV KTHDTNSZSWPUXWWOASLNGSYWPGSFPMVESGMTCTPXCSXLWPTMDBEHWLAVMFQQHEX WMFTMDBOASRVNYBKWJAXXVUEQIQUHQRDGFTUPKSGUWVHJUPCSYUCGLTYMHOWJWF SZSZGMJOWXEKRZIOIDVFUSVMNFNVPNOAHNQRTQMCNTWPGRYKQUIXWPGBJVBIIKWBJ AYRCTHJDDGNQBNCTMHZEASEMUTTZBJIXOWXENVVQTFQQOPZOAGBZWIFIALVGPWLVEI UOMCPJUUCNJQBROBHZVHJXVEOSVMERFWMFHJDZVCFQVQTTUQIISDBGOWSZQDZFMKT TQTAISWPGHJDZVWMHZGJJVCURJLOPSNVQVFTXVFWJOWXEMLUDEHDCUEMHNKRXWTQ VJGCU

The total Count Table of the occurrence of the encrypted character of first, second, third, fourth and fifth positions after every 5 interval from the above encrypted/encoded text is as shown below in orange.

It also shows the frequencies count of the occurrence, in percentage, of original English character. For the sake of simplicity I will avoid little more statistical calculation that we should perform to find the key and decipher the message. But the concept remains the same. For more understanding look at the C++ code given below. In C++ code, it finds the Sum Of Squares of each column.

We notice that highest occurrence in the frequency table of an encrypted text and in every occurrence is marked in orange are 'G', 'E', 'J', 'W' 'M'. Comparing this percentage with the original frequency with the character in English language, which is highlighted in blue. Since the highest occurrence is 'E', so from the educated guess we can simply assume that the encrypted character 'G', 'E', 'J', 'W' 'M' is obtained from the plain text 'E'. So ‘E’ is encrypted/encode/shifted to certain position to get 'G', ‘E’, 'J', 'W' and 'M’, in the first, second, third, fourth and fifth intervals respectively. 'E' has not been shifted to any position in the second occurrence. Hence the number of characters between the plain text and cipher text will give the actual key. For example.

'E' is encrypted to 'G' in the first occurrence and the number of character between 'E' and 'G' is 3,

'E' to 'G' count is 3

'E' to 'E' count is 0

'E' to 'J' count is 6

'E' to 'W' count is 19 (actually it is wrong key it should be 4. The C++ code using the statistics it finds the correct value.)

'E' to 'M' count is 9

The key in the fifth position is wrong, but little tweaking and human intervention it is easy to find the correct key. But as I said earlier, to make it simple, I have not explained some of the extra statistical analysis but the concept remains the same. Even without human intervention it is possible to completely break the cipher text. Please look the C++ code given below. In the code it finds the key by calculating sum of the squares of the each column and comparing with the uniform distributed value which is 1/26 as there are 26 characters.

Using the above recovered key which is 3, 0, 6, 4 9, it is possible to decipher or decrypt or crack the encrypted message. In the above method I have user only the CAPITILIZED English character. It is also possible to decipher or crack the cipher text containing small letters, space, punctuations etc. and even with other language.

The C++ code to break Vigenere-cipher is given below. It builds the table and calculated the Sum Of Squares and compare with actual value of the original English language frequency table. I didn't explain much because I don't have time.

Using the code

It is up to you to execute the program and is self explanatory,; I haven't showed you the output.

C++

#include <Windows.h>
#include <winerror.h>

#include <stdio.h>
#include <stdlib.h>
#include <conio.h>

#include <iostream>
#include <vector>
#include <map>
#include <numeric>
using namespace std;

#include <tchar.h>
#include <strsafe.h>

typedef map<uint8_t, double> UN8TOD;

typedef map<uint8_t, unsigned int> UN8TOUI;
typedef struct CHRTBLCNT
{
    UN8TOUI mapChrTable;
    int nChrTotal;
};
typedef vector<CHRTBLCNT> VECKEYLENTBLCNT;
typedef map<int, VECKEYLENTBLCNT> NTOTKEYLENTBLCNT;

const uint8_t unFirstEngChr = 'A';
const uint8_t unLastEngChr = 'Z';
UN8TOD g_mapundEngFreqTbl = {
    {'E', 12.702}, {'T', 9.056}, {'A', 8.167}, {'O', 7.507}, {'I', 6.966}, {'N', 6.749}, {'S', 6.327},
    {'H', 6.094}, {'R', 5.987}, {'D', 4.253}, {'L', 4.025}, {'C', 2.782}, {'U', 2.752}, {'M', 2.406},
    {'W', 2.360}, {'F', 2.228}, {'G', 2.015}, {'Y', 1.974}, {'P', 1.929}, {'B', 1.492}, {'V', 0.978},
    {'K', 0.772}, {'J', 0.153}, {'X', 0.150}, {'Q', 0.095}, {'Z', 0.074}
};

double MultipleAndSum(const UN8TOD& mapund1, const UN8TOD& mapund2)
{
    const UN8TOD* pmapund1 = &mapund1;
    const UN8TOD* pmapund2 = &mapund2;
    if (mapund1.size() > mapund2.size())
        std::swap(pmapund1, pmapund2);

    double dMultipleAndSum = 0;
    for (UN8TOD::const_iterator itr1 = pmapund1->begin(); itr1 != pmapund1->end(); itr1++)
    {
        UN8TOD::const_iterator itr2 = pmapund2->find(itr1->first);
        if (itr2 != pmapund2->end())
            dMultipleAndSum += (itr1->second * itr2->second);
    }
    return dMultipleAndSum;
}

double AvgSumOfSquresFreq(const VECKEYLENTBLCNT& vecKeyLenFreqTable)
{
    int nKeyPos = 0;
    double dChrSSFreqTot = 0;
    for (VECKEYLENTBLCNT::const_iterator itrFreqTblCount = vecKeyLenFreqTable.begin();
        itrFreqTblCount != vecKeyLenFreqTable.end(); itrFreqTblCount++)
    {
        double dSumOfSquers = 0;
        for (UN8TOUI::const_iterator itrCount = itrFreqTblCount->mapChrTable.begin();
            itrCount != itrFreqTblCount->mapChrTable.end(); itrCount++)
        {
            double dChrFreq = (double)itrCount->second / itrFreqTblCount->nChrTotal;
            dSumOfSquers += (dChrFreq * dChrFreq);
        }
        dChrSSFreqTot += dSumOfSquers;
        wprintf(L"Sig(Q%d) = %lf\n", nKeyPos++, dSumOfSquers);
    }
    double dAvgSumFreqTot = dChrSSFreqTot / vecKeyLenFreqTable.size(); // Average

    return dAvgSumFreqTot;
}

void BuildKeyPosFreqTbl(const vector<uint8_t>& vecEncryptedText, int nKeyLen, int nKeyPos,
    uint8_t unShiftVal, CHRTBLCNT& stFreqTableCount)
{
    int nTmpKeyLen = nKeyLen;
    uint8_t un8Char1;
    // Construct the frequency table for every key position
    for (vector<uint8_t>::const_iterator itrEncryptedText = vecEncryptedText.begin() + nKeyPos;
        itrEncryptedText != vecEncryptedText.end(); std::advance(itrEncryptedText, nTmpKeyLen))
    {
        uint8_t un8Char = *itrEncryptedText;
        if (unShiftVal != 0)
        {
            // Need to generalize to fit other language
            int nNoPostShift = unShiftVal - unFirstEngChr;
            un8Char = *itrEncryptedText - nNoPostShift;
            if (un8Char > unLastEngChr) un8Char = unFirstEngChr + (un8Char - unLastEngChr - 1);
        }

        UN8TOUI::iterator itrFreqCount = stFreqTableCount.mapChrTable.find(un8Char);
        if (itrFreqCount != stFreqTableCount.mapChrTable.end())
        {
            itrFreqCount->second++;
        }
        else
        {
            stFreqTableCount.mapChrTable.insert(make_pair(un8Char, 1));
        }
        stFreqTableCount.nChrTotal++;

        int nIndex = distance(vector<uint8_t>::const_iterator(vecEncryptedText.end()), itrEncryptedText);
        if (nTmpKeyLen > abs(nIndex)) nTmpKeyLen = abs(nIndex); // select the minimum so that it wont
                                                                // crash while using advance.
    }
}

void BuildKeyLenFreqTable(const vector<uint8_t>& vecEncryptedText, int nKeyLen, VECKEYLENTBLCNT& vecKeyLenFreqTable)
{
    for (int nKeyPos = 0; nKeyPos < nKeyLen; nKeyPos++)
    {
        CHRTBLCNT stFreqTableCount;
        stFreqTableCount.mapChrTable.clear();
        stFreqTableCount.nChrTotal = 0;

        BuildKeyPosFreqTbl(vecEncryptedText, nKeyLen, nKeyPos, 0, stFreqTableCount);

        vecKeyLenFreqTable.push_back(stFreqTableCount);
    }
}

int FindKeyLength(const vector<uint8_t>& vecEncryptedText)
{
    int nKeyLength = 0;

    //UN8TOUI mapFrequencyCount;
    NTOTKEYLENTBLCNT mapTotFereqTable;

    for (int nKeyLen = 1; nKeyLen <= 9; nKeyLen++)
    {
        VECKEYLENTBLCNT vecKeyLenFreqTable;
        vecKeyLenFreqTable.clear();

        BuildKeyLenFreqTable(vecEncryptedText, nKeyLen, vecKeyLenFreqTable);

        mapTotFereqTable.insert(make_pair(nKeyLen, vecKeyLenFreqTable));
    }

    double dEnglishSigmaSquare = MultipleAndSum(g_mapundEngFreqTbl, g_mapundEngFreqTbl) / 10000;

    // Calculate Sum of Squares for every key len
    for (NTOTKEYLENTBLCNT::const_iterator itrLenFreq = mapTotFereqTable.begin();
        itrLenFreq != mapTotFereqTable.end(); itrLenFreq++)
    {
        wprintf(L"Key Length %d\n", itrLenFreq->first);

        double dAvgSumFreqTot = AvgSumOfSquresFreq(itrLenFreq->second);

        wprintf(L"Tot Sig(Q) = %lf\n", dAvgSumFreqTot);
        if (dAvgSumFreqTot > dEnglishSigmaSquare) // 0.0655 for English letters
            nKeyLength = itrLenFreq->first;
    }
    return nKeyLength;
}

bool FindKey(const vector<uint8_t>& vecEncryptedText, int nKeyLen, vector<uint8_t>& vecKeyText)
{
    // Calculate the Sum of Square of the whihc is in percentage for the original english character.
    double dEngSumSquare = MultipleAndSum(g_mapundEngFreqTbl, g_mapundEngFreqTbl) / 10000;

    for (int i = 0; i < nKeyLen; i++)
    {
        int B = unFirstEngChr;
        for (B = unFirstEngChr; B <= unLastEngChr; B++)
        {
            CHRTBLCNT stChrTblCnt;
            stChrTblCnt.mapChrTable.clear();
            stChrTblCnt.nChrTotal = 0;

            BuildKeyPosFreqTbl(vecEncryptedText, nKeyLen, i, B, stChrTblCnt);

            UN8TOD mapundChrFreqTblCnt;
            for( UN8TOUI::iterator itrChrTbl = stChrTblCnt.mapChrTable.begin();
                itrChrTbl != stChrTblCnt.mapChrTable.end(); itrChrTbl++)
            {
                mapundChrFreqTblCnt.insert(
                    make_pair(itrChrTbl->first, ((double)itrChrTbl->second * 100/ stChrTblCnt.nChrTotal))
                );
            }

            double dEnchSumSquare = MultipleAndSum(g_mapundEngFreqTbl, mapundChrFreqTblCnt) / 10000;
            wprintf(L"Pos %d, Sig(Q%c) = %lf\n", i, B, dEnchSumSquare);

            //if (dEnchSumSquare >= dEngSumSquare) // Close to English frequency
            if (dEnchSumSquare > 0.05) // Above uniform or 1/g_mapundEngFreqTbl.size() or / 1/26
            {
                wprintf(L"-----\n");
                break;
            }
        }

        if (B <= unLastEngChr)
            vecKeyText[i] = B;
        //_getch();
    }
    return true;
}

void Decrypt(vector<uint8_t>& vecEncryptedText, vector<uint8_t> vecKey)
{
    for (int i = 0; i < vecEncryptedText.size(); i++)
    {
        vecEncryptedText[i] -= (vecKey[i % 5] - unFirstEngChr);
        if (vecEncryptedText[i] < unFirstEngChr) vecEncryptedText[i] = unLastEngChr - (unFirstEngChr - vecEncryptedText[i] - 1);
    }
}


void wmain()
{
    string szEncryptedText("IOILAIRJDBCLQWPGTNPMCLQWPGTNPMIOILAIRJDBNOAHQUNYMCUTXRUGTMLVIGTGLQE\
XUIVHJUTQVJRZKGNQIVEXIZQMMLUKTNVBJEKRCPDFWQQNTIPKSHKITAHWMTYTXKCNSRBUELPMPTXXJVRFFBQRFOBGRNWMXE\
WBBJISJOQDNVQPCQXLKNLKQUJZVBKCJDVFMJUKAIXUWQTJGQPTMLAWNFOBGRFETGFFFBIOILANOAHOQDNVTQVJWPGRJIWTE\
JYMTYIHKKSNRVJERDSGSNVKQNXLAVESWEKTMWPKSYUCVHFQLVHJHBGRSDTFEXWQPYKRZGVJUGREWVWPWMRPCSJYMTLNYMFW\
NOTDEKDQTWJVWOEYLUGSLHBKNYRBTOZETGBDDXRLDLVIHZPIPLTJQETTVQVUFWQQNXWPCTHDTNFTULKVNQMYIXGWOIAHNQU\
SGOTEFWAQLFFMKNORPPISNVQWNQOVHFWMXEWBBJISJIDOZWOQDKOWYSKUWOASXVCLYHZCBQHZGAQLBAOKSMTFJFBNOAHLKV\
NQMNOAHUCKJVQVSRRAVTTXKJISJIRPJDTUTTWPGHJDZVWMHVKTHDTNSZSWPUXWWOASLNGSYWPGSFPMVESGMTCTPXCSXLWPT\
MDBEHWLAVMFQQHEXWMFTMDBOASRVNYBKWJAXXVUEQIQUHQRDGFTUPKSGUWVHJUPCSYUCGLTYMHOWJWFSZSZGMJOWXEKRZIO\
IDVFUSVMNFNVPNOAHNQRTQMCNTWPGRYKQUIXWPGBJVBIIKWBJAYRCTHJDDGNQBNCTMHZEASEMUTTZBJIXOWXENVVQTFQQOP\
ZOAGBZWIFIALVGPWLVEIUOMCPJUUCNJQBROBHZVHJXVEOSVMERFWMFHJDZVCFQVQTTUQIISDBGOWSZQDZFMKTTQTAISWPGH\
JDZVWMHZGJJVCURJLOPSNVQVFTXVFWJOWXEMLUDEHDCUEMHNKRXWTQVJGCU");
    vector<uint8_t> vecun8EncryptedText(szEncryptedText.begin(), szEncryptedText.end());

    int nKeyLength = FindKeyLength(vecun8EncryptedText);

    vector<uint8_t> vecKey;
    vecKey.resize(nKeyLength);
    bool bRet = FindKey(vecun8EncryptedText, nKeyLength, vecKey);

    Decrypt(vecun8EncryptedText, vecKey);
	wprintf(L"Decrypted message\n");
    for (int i = 0; i < vecun8EncryptedText.size(); i++)
    {
        wprintf(L"%C", vecun8EncryptedText[i]);
    }
    wprintf(L"\n");
}

Points of Interest

Hence using some simple statistical frequency analysis, without having the need of AI or Machine Learning it is possible to crack the weaker form of encryption such as above.

The moral of the story is, use mathematics and think simple.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)