CS70, Spring 2004

Homework Assignment 9: Encryption

This assignment is due at 9 P.M. on Wednesday, April 14th, 2004. As usual, the README file is due at midnight the same day (i.e., the moment that Thursday starts). Refer to the homework policies page for general homework guidelines.

The primary purpose of this assignment is to see how lists can be used to build more complex data structures.

Overview

Your assignment is to implement a simple encryption program. To make the assignment both interesting and challenging, you will be required to use certain data structures and new techniques.

The cryptosystem that your program will implement is called a Vignère cipher. It is named after Blaise de Vignère, although it is really a corruption of a much more secure cipher he invented in 1585. The Vignère cipher is based the earlier Caesar cipher, which rotates letters through the alphabet. For example, a Caesar rotation of 1 letter would replace "A" by "B", "B" by "C", and so forth, wrapping around to replace "Z" by "A". A rotation of 2 would replace "A" by "C", etc.

Caesar used a constant rotation for his encoding, which made decryption quite simple. The variation named after Vignère modified the scheme by varying the rotation for each successive letter of the message. For example, the first letter might be rotated by 14 positions, the second by 5, and so forth. The pattern of rotation is controlled by a key, which is expressed as a simple word or phrase. The key is repeated when necessary to make it match up with the message.

An example may help explain this further. Suppose our key is "ALPHA" and we wish to encrypt the message "NOW IS THE TIME FOR CS FUN." Ignoring spaces, we can write the key and the message lined up as follows:

    ALPHAALPHAALPHAALPHA
    NOWISTHETIMEFORCSFUN
We will let "A" represent rotation by zero positions, "B" mean 1, and so forth. Then the encryption of the above message can be written directly below it:
    ALPHAALPHAALPHAALPHA
    NOWISTHETIMEFORCSFUN
    --------------------
    NZLPSTSTAIMPUVRCDUBN

It turns out that cryptographers traditionally break messages into 5-character groups to make things a bit easier to work with, so the pencil-and-paper method of doing the above would generate:

    ALPHA ALPHA ALPHA ALPHA
    NOWIS THETI MEFOR CSFUN
    -----------------------
    NZLPS TSTAI MPUVR CDUBN

When the message is decrypted, it is up to the recipient to figure out where the blanks and punctuation should have been.

We will take advantage of the computer's flexibility by implementing a slight variation on the Vignère cipher. Instead of encrypting a 26-letter alphabet, we will add support for blanks and encrypt 27 symbols: A through Z, plus blanks. However, we will keep the 5-character grouping feature, so we can't produce blanks in the encrypted output. Instead, our output alphabet will use a period to represent the 27th character. The message above, when encrypted through the sample solution with the key "ALPHA" and using the 27-symbol alphabet, then becomes:

    ALPHAALPHAALPHAALPHAALPHAAL
    NOW.IS.THE.TIME.FOR.CS.FUN.
    ---------------------------
    NZKGISKHOE.DXTE.QCY.CCOMUNK
where the blanks between words are represented by periods in the original message ("plaintext") and the trailing period (encrypted to "K") is generated by the newline that appears at the end of any well-formed Unix file. A pencil-and-paper cryptographer would write the final message as "NZKGI SKHOE .DXTE .QCY. CCOMU NK".

Incidentally, the Vignère cipher is not a secure method of encryption, except in one special case. Modern cryptographers can decode it with very little effort, so you should not try to use it to protect anything important. (The special case is that if the key is at least as long as the message, the key is a truly random string of characters rather than an English word or phrase, and the key is never ever used again anywhere in the world, then the Vignère method reduces to something called a "one-time pad", which is the only provably secure encryption method.)

Your encryption program will prompt the user for a password, and then encode or decode a message contained in a file. There are some very specific requirements for how the program operates, designed so that your functions could conceivably be moved into a different program (i.e., a mail reader) someday if you wished.

High-Level Interface

Your program will follow a fairly standard Unix interface style. There will be two ways to invoke your program, depending on whether you are encrypting or decrypting information.

Encryption

To encrypt, one can use:

    ./assign_09 -e file
	or
    ./assign_09 -e -g 5 file
	or
    ./assign_09 -g 5 -e file
The -e indicates that you want to encrypt file and write the result to cout. The optional -g switch specifies a grouping factor (i.e., generate code groups of 5 characters at a time). Your program should not assume that switches appear in any particular order. There is more information below on how to process command-line arguments.

It turns out (try running "ps auxw" or "ps -ef", depending on the machine you are using) that it is not a good idea to give passwords on the command line. Instead, your program should prompt for a password on cerr, using the string "Password: " (with a trailing blank, and no newline), and read a password from cin. The password should be a word or phrase, terminated by a newline character.

In a real encryption program, you would turn off character echoing on the terminal so that nobody could look over the user's shoulder and get the password, but that's a bit of a pain for a CS70 assignment, so your program should ignore that detail. The actual password will be somewhat modified from what the user types by converting lowercase to uppercase and changing non-alphabetic characters to blanks (the same rules will be applied to the message to be encrypted).

If the -g switch is not given, the output of the program should be a single line (followed by the usual newline). If -g appears, the output should be divided into groups of the specified number of characters, with each group separated by a blank. When groups are being generated, your program should never generate an output line longer than 70 characters (unless the argument to -g is itself greater than 70). There should be no blanks at the end of the output lines.

Since we are working with a 27-character alphabet ("A" through "Z" plus a blank), your program will not be able to deal nicely with lowercase characters and punctuation. Therefore, lowercase characters should be converted to uppercase, and all non-alphabetic characters should be treated as blanks (the same rules will apply to the password). For example, the string "CS 70 is really, really FUN!" would be encrypted exactly the same as if it read as follows (the line of dots is there to help you see all the blanks):

    CS    IS REALLY  REALLY FUN 
    ............................

Because blanks are used for grouping, they cannot be part of your output alphabet. Instead, the meaningful part of the output of an encryption run should be selected from "A" through "Z" and the period. The period should be considered to either precede "A" or follow "Z" (the choice is yours, and the results should be equivalent in either case).

Decryption

Decryption is simpler than encryption. There is only one way to decrypt:

    ./assign_09 -d file

The specified file should be some previous output of your program. As with encryption, you should prompt the user for the password. The decrypted message should be written to cout. Since all formatting and punctuation have been lost, you should write it as a single long line, and let the user worry about figuring it out.

Processing Command-Line Arguments

When a C or C++ program is invoked under Unix, you can give it one or more arguments (parameters) on the command line. The system preprocesses these arguments for you and makes them available to your main program as function parameters. You should declare your main program like this:

    int main(int argc, char* argv[])
    {
        // ...
    }
The parameter argc gives the number of arguments that appeared on the command line, including the name of the program itself. So an invocation like "./assign_09" will produce an argc of 1, while "./assign_09 x y z" will set argc to 4.

The parameter argv is an array of pointers to character, i.e., an array of C-style strings. argv[0] is always the name of the program as you invoked it, e.g., "./assign_09". Similarly, argv[1] is the first argument, expressed as a C string; argv[2] is the second argument, and so forth. For convenience, argv[argc] is guaranteed to be a NULL pointer. It is illegal to refer to argv[i] when i is greater than argc.

In most cases, Unix programs handle their command-line arguments in two phases. In the first phase, the options are extracted and recorded by setting various variables (often Boolean values). In the second phase, remaining non-option arguments are processed in the manner specified by the options.

First Phase of Command-Line Processing

As mentioned above, the first phase of argument handling involves processing the options, which typically begin with a dash. Some options (but not all) also require a following parameter. Because options are usually allowed to appear in any order, the option processing is normally done in a loop similar to the following:

    while (there are more arguments left)
        if (the next argument begins with a dash)
            process that argument
        else
            break

The "process that argument" section is the interesting part of the code. There are two typical approaches: either use a switch statement based on the second character of the option, or use an if/else if sequence to detect which option has been specified and to handle it.

When an option takes a parameter, there are a couple of tricky aspects to processing it. Perhaps the sneakiest involves the way that the parameter is swallowed up. Since it is a separate argument, so you need to get rid of as part of processing the argument itself. You also have to make sure that it's actually there. The common approach is to increment the loop index inside the option-processing code. For example:

    for (int argNo = 1;  argNo < argc;  argNo++) {
        // see above
        // ...
        // processing for option "-g":
            ++argNo;
            if (argNo >= argc)
                // Parameter is missing: issue usage error
            // parameter for "-g" is now in argv[argNo]
            // ..since we incremented argNo just now, and will
            // ..increment it again in the "for" statement, the
            // ..parameter for "-g" will not be examined to see if
            // ..it looks like an option.
    }

The other tricky aspect involves converting the parameter into a usable form. When main begins, all command-line arguments are expressed as C-style strings (char*). This might not be the best way to deal with them internally. In particular, for this assignment you'll want to convert the grouping factor from a string to an integer. Fortunately, there's a handy library routine to do just that for you. To use it, you should first #include <cstdlib>. The function is named strtol (convert a C-style string to a long). You can use it like this:

    int usefulThing = 0;                   // Default value is zero
    char* firstInvalidCharacter;
    // ...
    if (some useful decision) {
        usefulThing = strtol(argv[argNo], &firstInvalidCharacter, 0);
        if (argv[argNo][0] == '\0'  ||  *firstInvalidCharacter != '\0')
	    // error in argument, issue usage message

The strtol function converts the C string given as its first argument, into an integer and returns the value. The third argument gives the number base to use for conversion; if it's zero, the base is determined according to C++ syntax rules. The second argument is a bit weird: the function will fill it in with a pointer to the first non-numeric character in the string. If the string is all numeric, this will be the '\0' at the end. Thus, if firstInvalidCharacter is anything other than '\0', you had an argument that wasn't an integer and you should issue an error message. (The first clause in the "if" statement handles the case where the argument is a completely empty string.)

This may all sound complicated, but it's really very easy to write. You have already seen examples in the processOptions functions of assignments 5, 7, and 8.

Second Phase of Command-Line Processing

The second phase of argument processing involves handling the so-called positional arguments, which are those whose purpose is identified by their position on the command line. For example, the cp (copy) command in Unix accepts two positional arguments: the file to copy from and the file to copy into. In the command:

    cp -p foo bar
you are asking the program to copy foo to bar using the -p (preserve attributes) option.

For this assignment, there is only one positional argument, the file to be encrypted or decrypted. If you choose to have a separate option-processing function, it probably makes more sense to process the positional arguments inside main, not inside the option function.

Your program should verify that its arguments are correct. This includes ensuring that exactly one of -d and -e are specified, making sure that -g is not given with -d, making sure that -g has an argument, ensuring that a file to be encrypted or decrypted is given on the command line, and making sure that no illegal switches are given. If any or these rules are violated, you should print a usage message similar to the following:

    Usage: ./assign_09 {-e [-g n]|-d} file
All of the argument validations except one should be done inside the option-processing function (if you have one). The exception is checking to be sure there is a filename; it makes more sense to verify that detail inside main.

Submission Mechanics

As usual, you must check out your assignment before beginning by using "cs70checkout hw09". This is true even though you will be writing 100% of the program yourself.

For homework #9, you must submit the following files:

Makefile
A "make file" containing instructions on how to compile your program with the make utility. For this assignment, you must produce your own Makefile. I suggest that you start with one from a previous assignment. You will be graded on your Makefile, so be sure you modify it and test it.

The makefile you provide must produce an executable named assign_09.

assign_09.cc
The C++ code for your main program.
*.hh, *.cc, *.icc
Header and source files containing the classes you implement. Some of these will be lifted directly from previous assignments, or will be extended versions of classes in previous assignments. It is up to you to choose the names for these files.
README
A documentation file, as specified in the homework policies page. Note that this file is not due until 3 hours after the other files in the assignment.

If you wish, you can create other files to help you develop this assignment, but it is not necessary.

When you have a working solution, you must submit your files with cs70submit. If you create any new files, you need to tell the submission system about them by mentioning them once on a cs70submit command line. For convenience, we have provided dummy versions of README, Makefile, and assign_09.cc so that they will be sure to get submitted.

Conversion to Standard Format

We have already discussed most of the high-level interface to your program. It should accept both the password and the message to be encrypted in mixed case. For both strings, it should convert lowercase characters to uppercase and convert non-alphabetic characters to blanks. The <cctype> header file defines a couple of functions that will be useful in this regard:

isalpha(ch)
returns true if the character ch is alphabetic ("A" to "Z" or "a" to "z") and false otherwise.
isspace(ch)
returns true if the character ch is whitespace (blank, TAB, newline, or one of a few other special characters).
toupper(ch)
returns the uppercase equivalent of ch if ch is alphabetic. The result is unreliable if ch is not alphabetic.

Doing Arithmetic Operations on Characters

This assignment would be much easier if characters were encoded in a friendly fashion. For example, if the letter 'A' were represented by the number 0, 'B' by 1, and so forth up to 'Z' = 25, with 26 representing a blank, it would be relatively easy to write the encryption code. Unfortunately, 'A' is decimal 65. However, there is an easy way to solve this problem: do arithmetic on characters. As an example, you can convert a character ch to the 'A' = 0 scheme with the following code:

    char convertedCharacter = ch - 'A';
This trick will work only if ch is one of the uppercase letters 'A' through 'Z'. It will not work if ch is lowercase, a blank, or some other special characters.

In the same way, you convert a number between 0 and 25 back to an uppercase letter with:

    ch = convertedCharacter + 'A';
Again, this will only work if convertedCharacter is 0 through 25. If it is some other value, it will not generate a valid letter.

Some of you will notice that the above code will work only on computers that use ASCII or a similar encoding. No problem; it's OK if your program only works on those computers.

Required Techniques and Data

For this assignment, you will be required to use a number of data structures, data elements, and techniques that are not a direct consequence of the external interface requirements. The purpose of these extra restrictions is to force you to get practice with a number of important C++ data structures and techniques.

Required Techniques

There are two obvious ways in which a simple encryption program might work. Both ways assume that you already have the password stored internally. The first approach is to read a single character at a time from the input file, encrypt it, and write it to the output.

The second approach is to read the entire input file into a giant internal string. After reading the input, you can iterate through the string, encrypt each character, and store the encrypted version back into the string (modifying the character in-place). Finally, you can write the encrypted string to the output. You must use this second approach in this assignment. This is partly because it will give you more practice in using iterators, and partly because it will make your code cleaner.

An important design detail is that you should not insert grouping characters as you encrypt. Instead, you should implement the -g switch as part of your output routine.

Your program must store both the password and the string to be encrypted using the chunky string class described below.

The Templated List Class

For this assignment, you are required to use your templated list class from assignment 8 as an underlying data structure. If you did that assignment well, you should need to make no further changes in your list class. You may not add functions to the list class that are specialized to supporting the encryption assignment, but that are not useful other purposes. All functions that your list class provides should be generic, in the sense that they would make sense in a wide variety of programs. For example, your list class should not have a function that converts its data member to lower case, because that is not generic, but it is OK to add a peekTail function because that might be handy in other applications.

The peekTail example was not chosen accidentally. You will almost certainly need to have peekTail for use by your chunky strings. For symmetry, peekHead is also a good idea, although you probably won't need it for this assignment.

Whereas the list-pop functions would return a data type by value (e.g., DATA popHead();), the peek function(s) should return by reference (DATA& peekTail();) so that the caller can directly manipulate the information stored in the list.

The same modification rules apply to your list iterator class. You may wish to add a general-purpose reset function to it, but there should be nothing specialized specifically to the encryption assignment that is not useful elsewhere.

Required Data Structures

Your solution to this assignment must make use of a rather interesting string class that stores a string as a linked list of fixed-size "chunks," each of which is four characters long. Such a class is a compromise between storing the string as an array (which makes inserting characters expensive) and storing it as a linked list of single characters (which wastes memory).

Since a string is represented as a linked list of chunks, a small string would fit in a single list element. If the string is too large to fit into a single piece, you create a second piece and then tie it together with the first, using the linked list as the underlying structure.

Your string class should be built on top of the list class; like the main program, it should have no direct knowledge of the structure of the list. In other words, you may not make the string a friend of the list class.

In case the above description is not clear, here is an example of the private data from my version of the structure that is used to store the individual pieces of the string:

    class Chunk
        {
        // ...
        private:
            unsigned int        length;
            char                value[CHUNKSIZE];
        };
where CHUNKSIZE is a constant giving the number of characters in each piece (i.e., 4). I can then declare the entire string as a list using something like List<Chunk> string in the private data section of my ChunkyString class.

Operations in the String Class

To the outside world, your string class should just look like something that stores strings. The user should not be able to tell, based on the interface, that the strings are stored in pieces instead of as a single array of characters. This has two implications:

  1. You should support most of the "standard" operations, and
  2. Your private data should be private and not visible to the outside world in any way.

The exact set of operations you choose to support is up to you, and to some extent it depends on what your program needs. I found the following operations to be minimally necessary in my own implementation:

With the exception of the get-length function, I chose to implement all of the above as overloaded operators. In addition to the above, list, I implemented a number of functions, including general concatenation and all of the Boolean comparison operators, just in case I needed them. Your mileage may vary, of course.

Complexity

All of the above functions should be implemented with the minimum complexity possible. In the above list, the default constructor, "+=" operator (for single characters), and get-length operation should all be O(1). The copy constructor, assignment operator, "+" operator, and stream output are inherently O(N) and should be implemented that way.

Required Functions

The String Iterator

Your string-as-list class must allow the outside user to treat it just like any other string, hiding the internal representation. Your main program should have no knowledge of the fact that the string is internally represented in chunks. To make that possible, you will need a string iterator that allows the main program to walk through the individual characters of the string.

A string iterator is required for this assignment. Your string iterator should be built on top of the list iterator; like the main program, it should have no direct knowledge of the structure of the list. In other words, you may not make the string iterator a friend of the list iterator. I found it convenient to have a list iterator as a private data member in my string-iterator class; I called it subIterator because it is a subsidiary iterator that is hidden from the outside world.

When you are done building the string iterator, you should be able to do something like this (assuming your string class is called ChunkyString and the iterator is ChunkyStringIterator):

    ChunkyString stuff;
    // .. put characters into the string
    for (ChunkyStringIterator i(stuff);  i;  i++) {
        if (*i == 'a')
            cout << "I found an A in the string\n";
    }

Note that the user of the ChunkyString has no knowledge of the fact that there is a list hiding underneath. If your main program even mentions the List class or the ListIterator class, you have taken the wrong approach and you will lose many points.

How to Read Input

Just as output is written to an ostream such as cout or cerr, input is read from an istream such as cin. Doing so for this assignment will require that you use several C++ I/O features. Most of these features are enabled when you #include <iostream>. More complete details on the functions discussed below can be found in the notes on C++ I/O.

To read the password, you will need to read one character at a time from cin. This can be done with code like the following:

    char nextCharacter;
    while (cin.get(nextCharacter))
        // ...do stuff with nextCharacter

In the above code, the loop will exit when there are no more characters available on cin (i.e., EOF was hit). Note that EOF is not the same as the end of a line. Since the password is only one line long, you must detect the end of the line yourself.

The string to be encrypted must be read from a file whose name is given to you on the command line. Before you can read a named file, you must #include <fstream>. Then you must open the file, read it, and close it. In C++, this is easy: you open a file by creating an ifstream (for reading, or input) or ofstream (for writing, or output). The file is automatically closed when the associated ifstream or ofstream is destroyed.

Once you have created an ifstream, you can read from it just as if it were cin.

To make this explanation more concrete, suppose you want to read characters from a file named "myfile.txt". You could write something like this:

    ifstream inputStream("myfile.txt");
    if (!inputStream)
        // ...Oops, myfile.txt doesn't seem to be available!
    char nextCharacter;
    while (inputStream.get(nextCharacter))
        // ...do stuff with nextCharacter

Of course, in most cases you won't want to hardwire the file name into your code. (Even if you did, it counts as a "magic number", so you should define it as a const string rather than sticking it into the middle of your program.) Here's a very similar example, only this time the file name is stored in a const string& variable named whichFile, which is passed as a function argument. This function returns true if the file was successfully read. (It also doesn't return the string read, which makes it somewhat useless. Fixing that deficiency is left to you as an exercise.) The c_str member function of the string class converts a C++ string into a C-style char*; this is necessary because the ifstream constructor stupidly won't accept strings.

bool readFile(const string& whichFile)
{
    ifstream inputStream(whichFile.c_str());
    if (!inputStream)
        return false;                      // Couldn't open the file
    char nextCharacter;
    while (inputStream.get(nextCharacter))
        // ...do stuff with nextCharacter
    return true;
}

Some Samples

Here is a trivial sample input file, and the output it generates when encrypted with -g 5 and the pass phrase "cs fun". If that output file is fed back into the decryption routine, it generates a slightly modified version of the original. Note that the decrypted version has a blank at the end (visible only if you download it and use an editor or "cat -vet" to examine it). The trailing blank did not appear in the original. Why is it there now?

Encrypting a different input file with the pass phrase "My roommate never studies, why should I?" but no -g switch produces a single very long output line. The decrypted version of the file demonstrates that a certain amount of (presumably) useful information has been lost.

Finally, to make the assignment interesting, here is an encrypted file for you to decode once your program is working correctly. The file was encrypted with the pass phrase "When I get my program working I am going to get some sleep". (Note that there is no trailing period in the pass phrase; it consists solely of alphabetic characters and blanks.)

All of these sample files will be placed in your directory when you cs70checkout the assignment.

Emacs and icc Files

By default, emacs doesn't know that icc files contain C++ code. There are three ways to tell emacs to use C++ mode:

  1. Execute "ESC x c++-mode RET" each time you visit the file. Obviously, this is a pain.
  2. Add the line "// ;-*-C++-*-" as the first line of the file. Emacs will recognize the line and automatically switch to C++ mode. This is less of a pain, but you still have to do it to every file.
  3. Add the following line to your ".emacs" file in your home directory. (If you don't have a ".emacs" file, create one containing this line):
    (setq auto-mode-alist (append '(("\\.icc$" . c++-mode)) auto-mode-alist))
    	
    The line must be inserted exactly as given above, including the double backslash, the parentheses, and the funny single quote.

Tricky Stuff

As usual, there are parts of this assignment that contain traps. Here are a few:


Other Resources

There is more information on using C++ on Turing available in the departmental quick-reference guide and the C++ quick reference guide.


© 2004, Geoff Kuenning

This page is maintained by Geoff Kuenning.