A Tribute to Perl
Today marks the 32nd (a not-insignificant number for computers) birthday of Perl. Brian Fox didn't start coding Bash until the following year, and it would be another four years before Tim Berners-Lee and the team at CERN published the first documented version of HTTP.
More powerful than Unix shells, Perl quickly became the de-facto scripting choice for system administrators. By the late 90's, Lincoln Stein's CGI.pm module (now maintained by Lee Johnson) helped turn Perl into the language of the web.
A lot has happened in the last two decades, with many articles written about the demise of Perl, its rebirth, and its future. Though it is no longer widely in use, it still has its place and it has undeniably helped shape many of the modern scripting languages of today.
In celebration of Perl, I checked RosettaCode for tasks needing implementation. RosettaCode is a programming chrestomathy site with many tasks coded in different languages. One of the missing tasks for Perl was code to perform cryptanalysis of the Vigenère cipher.
The Code
I wrote the code, which can be found here, using the Kasiski method for attacking polyalphabetic substitution ciphers. A fully usable Perl script, with options to encrypt, decrypt, and crack keys, is posted on github. Keep reading for a brief explanation of how it works.
The Vigenère
The Vigenère cipher is a form of polyalphabetic substitution, using a series of interwoven Ceasar ciphers. First developed in the 16th century, it took three centries to crack. The Kasiski method was the first published attack, with several distinct steps needed to obtain the encryption key.
Guessing the Keylength
Since the Vigenère repeats the key used for encryption, it is possible to guess the keylength by finding repeating sequences in the ciphertext and measuring how far apart they are. All factors of each repeating distance may be the correct keylength. By intersecting the union of all factors and finding the most frequent, it is possible to guess the most likely keylength. In order to increase the chances of a successful attack, multiple keylengths (e.g. the three or four most common) can be tested. NOTE: a factor of one can be safely discarded as it is too small a keylength and would essentially amount to a Caesar cipher.
Specific Ciphertext Strings
Once the keylength is guessed, form a string of ciphertext starting with the first character and skipping keylength characters. Repeat, starting with the second character and continue until the keylength has been exhausted (see example below).
Ciphertext: XPALYGGLJGTQSTQERPILRAPZWTRIDSFDAVTZIEMHLW
Keylength: 3
XLGGSEIAWIFVIH - every third letter starting with the first
PYLTTRLPTDDTEL - every third letter starting with the second
AGJQQPRZRSAZMW - every third letter starting with the third
These strings represent characters that were encoded with the first, second and third letters of the key, respectively.
NOTE: the example above is too short to crack with this method. Though other methods are readily available, none have been coded in Perl.
Frequency Analysis
Specific ciphertext, encoded by each individual character of the key, is now available. Starting with the first string, decode it 26 times using a different letter of the alphabet as the key each time. One of the resulting plaintext strings was decoded using the right key.
Frequency analysis can be used to determine which is most likely the correct key. Compare the frequency of the letters in each decoded string to the expected frequency of occurrence in the language of the plaintext.
To understand how this works, consider that in English text, the letter 'E' appears on average about 12% of the time. If the letter 'E' appears at about the same average in the decoded text, there is a higher likelihood that it was decoded by the correct key (assuming the plaintext was in English, of course).
In practice, each letter of the decoded string is analyzed and the entire string can be scored for accuracy. The string with the highest overall accuracy is most likely to have been decoded with the correct key. By repeating the process for each string, it is possible to recreate the entire key one character at a time.
Although the math is beyond the scope of this article, chi-square statistics are used to calculate frequency analysis scores. Note that it is sometimes also necessary to deal with multiple strings that calculate to similar frequencies.
Wrapping It Up
Once the key has been cracked, the decryption algorithm is straightforward. For each character in the ciphertext, subtract the value of the corresponding character in the key and use a modulus of 26 to ensure that the decoded characters wrap around if the result is negative (for example, a value of -1 would wrap back to a 'Z'). Once the characters in the key have exhausted, they are started again from the beginning until the entire ciphertext is decrypted.