CS70, Fall 2002

Homework Assignment #10

This assignment is due at 10 P.M. on Wednesday, November 27th, 2002. Exception: the README file is due at 1 A.M. on the following day (Thursday). Refer to the homework policies page for submission instructions and general homework guidelines.

The primary purpose of this assignment is to gain experience with hash tables.

Overview

In this assignment, you will create a simple spell checker. The program will read a dictionary from a file that is given as the first argument, insert the words into a hash table, and report collision statistics.

After reading the dictionary, the spelling checker will read a list of words from the standard input. Each word will be looked up in the dictionary. If it is incorrect, it will be written to the standard output together with a list of suggested corrections. (This is similar to ispell's -a mode.) The algorithm for generating corrections is given below.

The Hash Table

The dictionary will consist of a list of words, separated by whitespace. For convenience, the words will be given in lower case, so you do not need to worry about capitalization issues. You will need to insert them into a hash table that grows dynamically as necessary to hold the dictionary while keeping the load factor low enough. It is up to you to decide how to handle collisions: separate chaining, linear probing, quadratic probing, or rehashing. Your hash table must be implemented as a general-purpose class, although it does not need to be templated.

Hash Functions

Designing a good hash function is something of a black art. We have provided a separate Web page that briefly discusses some hash functions and how they work. However, you don't necessarily have to write a hash function as part of this assignment.

Because we don't have time to cover hash functions in lecture, and because of the limited amount of time you have to work on the program, we have provided a hash function for you. Actually, we have provided several for you to choose from, together with a header file that you can #include so that they are easy to use.

If you are short on time, we suggest that you use hashStringCRC (with a prime table size) or hashStringBUZ as your hash function. However, if you have more time, we suggest that you experiment with several different hash functions to find out which works best (in terms of the collision statistics).

All of the hash functions have descriptive comments in the source file. Before you choose a function, be sure to read the comments (for example, some functions work very badly with certain table sizes).

You are not required to use one of our hash functions. If you want to experiment with writing your own, please feel free. However, you should test your function thoroughly so that you can be sure that it gives you good collision statistics in a wide variety of conditions.

Hash Table Statistics

To help you understand how your hashing code works, you should track and report the following statistics:

The number of times you had to expand the table.
The load factor in the table.
The number of insertions that encountered a collision.
The length of the longest known collision chain (depending on your collision-handling method, this might be less then the length of the longest chain in the table: why?).

Note that all but the first of these statistics will need to be reinitialized whenever you expand the table. After you read the dictionary, you should report the above statistics to cerr. If you wind up with a collision chain longer than about 15, there is something seriously wrong with your hash function or your collision method, and points will be deducted. (This means that linear probing is probably inappropriate.)

Spell Checking

Once the dictionary has been created, your program will read a list of words from standard input. If a word is found in the dictionary, your program should produce no output. Otherwise, you should generate suggested corrections and write them, together with the original word (converted to lowercase), as a single output line. For example, suppose the input word was "Wird". The output might be:

wird: bird gird ward word wild wind wire wiry

Unlike the dictionary, the words input to your program may be in any case. You can convert a string to lower case by including the ctype.h header file and using the isupper and tolower functions:

#include <ctype.h>
...
    string mystring("ABcdEFg!@KLm");
    for (string::iterator nextChar = mystring.begin();
      nextChar != myString.end();
      nextChar++) {
        if (isupper(*nextChar))
            *nextChar = tolower(*nextChar);
    }

Generating Corrections

The easiest way to generate corrections in a spell checker is a trial-and-error method. If we assume that the misspelled word contains only a single error, we can try all possible corrections and look each up in the dictionary.

Traditionally, spelling checkers have looked for four possible errors: a wrong letter ("wird"), an inserted letter ("woprd"), a deleted letter ("wrd"), or a pair of adjacent transposed letters ("wrod"). To simplify this assignment, you will only need to deal with the first possibility, a wrong letter. When a word isn't found in the dictionary, you will need to look up all variants that can be generated by changing one letter. For example, given "wird," you should look up "aird", "bird", "cird", etc. through "zird", then "ward", "wbrd", "wcrd" through "wzrd", and so forth. Whenever you find a match in the dictionary, you should add it to your output line.

Input Format

Both the dictionary and the file to be spell-checked consist of arbitrary-length words separated by whitespace. The easy way to represent them is as C++ strings. You can then easily read them in and manipulate them using something like:

    string word;
    // ...
    while (cin >> word) {
	if (islower (word[0]))
	    word[0] = toupper(word[0]);
    }

When used with a string, the >> operator will skip over any whitespace and then grab the next string of non-whitespace characters -- exactly what you need.

For convenience, neither the dictionary nor the input file will contain punctuation. If you would like to test your spelling checker on a "real" input file (such as your README), you can remove the punctuation with the tr program. The method for using tr varies depending on your system. On any sane operating system (e.g., Linux), you could do:

    tr -c 'A-Za-z \010-\015' ' ' < README | ./assign_10 my-dictionary

On Turing, however, you have to use a broken notation:

    tr -c '[A-Za-z \010-\015]' '[ *0]' < README | ./assign_10 my-dictionary

(You may find it instructive to study the tr manual page to learn how the above command works.)

If you have a file that has already been cleaned up (so it only contains alphabetics and whitespace), you could do:

    ./assign_10 my-dictionary < error-filled-file.txt

You can also just type directly to stdin:

    ./assign_10 my-dictionary

In that case, you'll need to type control-D at the end of your input to generate an EOF.

Sample Dictionaries

You may wish to create a very small sample dictionary of your own for initial testing. A slightly larger dictionary of 341 words should help you to get most of your bugs out. When you're fairly confident, you can try your luck with over 34,000 words in an all-lowercase version of the ispell dictionary.

Output Format

The spell-check program should produce one line of statistical output on cerr, and zero or more lines of correction output on cout.

The statistical output should be in the format:
n expansions, load factor f, n collisions, longest chain n
where n is an integer and f is a floating-point (double) number.

Each line in the correction output should consist of the incorrect word, followed by a colon and zero or more corrections, separated by spaces. There should be exactly one space after the colon (unless there are no corrections), and there should be no space at the end of the line. The following are examples of valid output lines:

xyzzy:
foo: for
wird: bird gird ward word wild wind wire wiry

If every word in the input is found in the dictionary, the spell checker should produce no output on cout.

Sample Files

As usual, we have collected the sample dictionaries and the source files into a tar archive and a zip archive so that they will be easy to download.

No Restrictions on C++ Libraries

With one exception, there are no restrictions on your use of the standard C++ libraries for this assignment. The single exception is that you are not allowed to use a hash-table library! In particular, though, you are allowed to use the string type. The string type, in particular, will simplify your life on input. You can read a word from a stream (either for the dictionary or to spell-check it) with code like this:

    string nextWord;
    stream >> nextWord;
    if (!stream)
        // EOF was hit.

Although you are allowed to use the library's list class, we would prefer that you use your own templated list class from previous assignments. After all, that's why you put so much work into it, right?

Also, a small word of warning: you may be tempted to use the vector class from the library to manage your hash buckets. Although it is possible to do so effectively, it is trickier than might first appear. In particular, the resize function is not appropriate for resizing hash tables. I recommend that rather than using vector, you simply manage the array of hash buckets yourself.

Compilation

The code you submit will be compiled with the g++ options -Wall and -pedantic. Your program should produce no errors or warning messages when compiled with these options on Turing. If you absolutely cannot get rid of a warning, even with the help of the professor or the graders, document it in the README file along with the names of anyone who helped you try to understand the problem.

Submitting

Your submission should consist of a number of files:

Makefile: A "make file" containing instructions on how to compile your program with the make utility.
The makefile you provide must produce an executable named assign_10.
assign_10.cc: The C++ code for your main program for the assignment.
*.hh, *.cc, *.icc: Header and source files containing the classes you implement. Some of these can be lifted directly from previous assignments, or can be extended versions of classes in previous assignments. It is up to you to choose the names for these files.
README: A documentation file, as specified in the homework policies page. Note that this file is not due until 3 hours after the other files in the assignment.

If you wish, you can create other files to help you develop this assignment, but it is not necessary.

When you are ready to submit your program, cd into the proper directory (e.g., cs70/hw10) and run the cs70submitall command. This command will prompt you for the assignment number (see the top of this Web page) and will then capture all of the source files, Makefiles, and README files in your directory, so BE CERTAIN that you don't have anything in your directory besides your assignment.

If you discover a mistake in your program, you can resubmit it using the same command. You can submit as many times as you like; only the last version will be used.

Since the README file is due later than the rest of the assignment, you may choose to submit it separately. You can do this with the cs70submit command:

    cs70submit README

If you already submitted your code separately, DO NOT use cs70submitall to submit the README file, or it will appear that you missed the deadline even though you were really on time.

Tricky Stuff

As usual, there are parts of this assignment that contain traps. Here are a few:

When you are developing a hash table, it is wise to start your debugging with a dummy hash function that always returns zero. Once you are sure your collision handling works correctly, you can write a real hash function.
If you get excessive collisions, be sure your hash function is returning values that are well spread out. Test it with "nearby" values such as "aaa", "aab", "baa", etc.
Remember that the hash functions can return a number larger than the table size (except for hashStringBase256). You must reduce the hash value yourself to make sure you don't go beyond your array bounds.
If you use separate chaining, you can use your list class (with a few extensions) to manage the chains. Alternatively, you could re-implement the list functions inside your hash-table code, but that approach isn't as "C++-ish".
Remember that when you expand your hash table, you must re-hash everything in the current table, since the wrapping due to the modulo function will change. If you are using separate chaining, don't forget to follow your collision chains appropriately.

There is more information on using C++ on Turing available in the departmental quick-reference guide and the C++ quick reference guide. You can find information about debugging in the gdb quick reference guide.

This page is maintained by Geoff Kuenning.