This assignment is due at 9 P.M. on Wednesday, April 21st, 2004. As usual, the README file is due at midnight the same day (i.e., the moment that Thursday starts). Refer to the homework policies page for submission instructions and general homework guidelines.
The primary purpose of this assignment is to gain experience with hash tables.
In this assignment, you will create a simple spell checker. The program will read a dictionary from a file that is given as the first argument, insert the words into a hash table, and report collision statistics.
After reading the dictionary, the spelling checker will read a list of
words from
the standard input. Each word will be looked up in the dictionary.
If it is incorrect, it will be written to the standard output together
with a list of suggested corrections. (This is similar to ispell's
-a
mode.) The algorithm for
generating corrections is given below.
The dictionary will consist of a list of words, separated by whitespace. For convenience, the words will be given in lower case, so you do not need to worry about capitalization issues. You will need to insert them into a hash table that grows dynamically as necessary to hold the dictionary while keeping the load factor low enough. It is up to you to decide how to handle collisions: separate chaining, linear probing, quadratic probing, or rehashing. Your hash table must be implemented as a general-purpose class, although it does not need to be templated.
Designing a good hash function is something of a black art. We have provided a separate Web page that briefly discusses some hash functions and how they work. However, you don't necessarily have to write a hash function as part of this assignment.
Because we don't have time to cover hash functions in lecture, and
because of the limited amount of time you
have to work on the program, we have provided a hash function for you.
Actually, we have provided several for you
to choose from, together with a header file
that you can #include
so that they are easy to use.
If you are short on time, we suggest that you use
hashStringCRC
(with a prime table size) or
hashStringBUZ
as your hash function. However, if you
have more time, we suggest that you experiment with several different
hash functions to find out which works best (in terms of the collision
statistics).
All of the hash functions have descriptive comments in the source file. Before you choose a function, be sure to read the comments (for example, some functions work very badly with certain table sizes).
You are not required to use one of our hash functions. If you want to experiment with writing your own, please feel free. However, you should test your function thoroughly so that you can be sure that it gives you good collision statistics in a wide variety of conditions.
To help you understand how your hashing code works, you should track and report the following statistics:
Note that all but the first of these statistics will need to be reinitialized
whenever you expand the table. After you read the dictionary, you
should report the above statistics
to cerr
. If you wind
up with a collision chain longer than about 15, there is something
seriously wrong with your hash function or your collision method,
and points will be deducted. (This means that linear probing is
probably inappropriate.)
Once the dictionary has been created, your program will read a list of words from standard input. If a word is found in the dictionary, your program should produce no output. Otherwise, you should generate suggested corrections and write them, together with the original word (converted to lowercase), as a single output line. For example, suppose the input word was "Wird". The output might be:
wird: bird gird ward word wild wind wire wiry
Unlike the dictionary, the words input to your program may be in any
case. You can convert a string to lower case by including the
cctype
header file and using the isupper
and tolower
functions:
#include <cctype> ... string mystring("ABcdEFg!@KLm"); for (string::iterator nextChar = mystring.begin(); nextChar != myString.end(); nextChar++) { if (isupper(*nextChar)) *nextChar = tolower(*nextChar); }
The easiest way to generate corrections in a spell checker is a trial-and-error method. If we assume that the misspelled word contains only a single error, we can try all possible corrections and look each up in the dictionary.
Traditionally, spelling checkers have looked for four possible errors: a wrong letter ("wird"), an inserted letter ("woprd"), a deleted letter ("wrd"), or a pair of adjacent transposed letters ("wrod"). To simplify this assignment, you will only need to deal with the first possibility, a wrong letter. When a word isn't found in the dictionary, you will need to look up all variants that can be generated by changing one letter. For example, given "wird," you should look up "aird", "bird", "cird", etc. through "zird", then "ward", "wbrd", "wcrd" through "wzrd", and so forth. Whenever you find a match in the dictionary, you should add it to your output line.
Both the dictionary and the file to be spell-checked consist of
arbitrary-length words
separated by whitespace.
The easy way to represent them is as C++ string
s. You
can then easily read them in and manipulate them using something like:
string word; // ... while (cin >> word) { if (islower (word[0])) word[0] = toupper(word[0]); }When used with a
string
, the >> operator will skip
over any whitespace and then grab the next string of non-whitespace
characters -- exactly what you need.
For convenience,
neither the dictionary nor the input file will contain punctuation.
If you would like to test your spelling checker on a "real" input file
(such as your README), you can remove the punctuation with the
tr
program.
The method for using tr
varies depending on your system.
On any sane operating system (e.g., Linux), you could do:
tr -c 'A-Za-z \010-\015' ' ' < README | ./assign_10 my-dictionaryOn Turing, however, you have to use a broken notation:
tr -c '[A-Za-z \010-\015]' '[ *0]' < README | ./assign_10 my-dictionary(You may find it instructive to study the
tr
manual page
to learn how the above command works.)
If you have a file that has already been cleaned up (so it only contains alphabetics and whitespace), you could do:
./assign_10 my-dictionary < error-filled-file.txt
You can also just type directly to stdin:
./assign_10 my-dictionaryIn that case, you'll need to type control-D at the end of your input to generate an EOF.
You may wish to create a very small sample dictionary of your own for
initial testing. A slightly larger
dictionary of 341 words should help you to get most of your bugs out.
When you're fairly confident, you can try your luck with over 34,000
words in an all-lowercase version of the
ispell dictionary. The latter file can be found on Turing in
"~cs70grad/ispell.words
".
The spell-check program should produce one line of statistical output on
cerr
, and zero or more lines of correction output on
cout
.
The statistical output should be in the format:
n expansions, load factor f, n collisions, longest chain n
where n is an integer and f is a floating-point
(double) number.
Each line in the correction output should consist of the incorrect word, followed by a colon and zero or more corrections, separated by spaces. There should be exactly one space after the colon (unless there are no corrections), and there should be no space at the end of the line. The following are examples of valid output lines:
xyzzy: foo: for wird: bird gird ward word wild wind wire wiry
If every word in the input is found in the dictionary, the spell
checker should produce no output on cout
.
When you check out your copy of the assignment, you will get a copy of
"simple.dict
", the small dictionary. Because the ispell
dictionary is moderately large, it is not included in the checkout.
Instead, you can use it directly from the CS70 grader account. For
example:
./assign_10 ~cs70grad/ispell.words < error-filled-file.txt
With one exception, there are no restrictions on your use of the
standard C++ libraries for this assignment. The single exception is
that you are not allowed to use a hash-table library! In particular,
though, you are allowed to use the string
type.
The string
type will greatly simplify your life
on input. You can read a word from a stream (either for the
dictionary or to spell-check it) with code like this:
string nextWord; stream >> nextWord; if (!stream) // EOF was hit.
Although you are allowed to use the library's list
class,
we would prefer that you use your own templated list class from
previous assignments. After all, that's why you put so much work into
it, right?
Also, a small word of warning: you may be tempted to use the
vector
class from the library to manage your hash
buckets. Although it is possible to do so effectively, it is trickier
than might first appear. In particular, the resize
function is not appropriate for resizing hash tables. I
recommend that rather than using vector
, you simply
manage the array of hash buckets yourself.
The code you submit will be compiled with the g++
options
-Wall
and -pedantic
. Your program should
produce no errors or warning messages when compiled
with these options
on Turing. If you absolutely cannot get rid of a warning, even with
the help of the professor or the graders, document it in the README
file along with the names of anyone who helped you try to understand
the problem.
As usual, you must
check out your assignment before beginning by using
"cs70checkout hw10
". This is true even though you will
be writing 100% of the program yourself. The checkout will provide
you with two C++ source files, hashfuncs.hh
and
hashfuncs.cc
, which implement a variety of hash
functions. However, you are not required to use these hash
functions.
Your submission should consist of a number of files:
Makefile
make
utility.
The makefile you provide must produce an
executable named assign_10
.
assign_10.cc
README
If you wish, you can create other files to help you develop this assignment, but it is not necessary.
When you have a working solution, you must submit your files with
cs70submit
. If you create any new files, you need to
tell the submission system about them by mentioning them once on a
cs70submit
command line.
For convenience, we have provided dummy versions of
README
, Makefile, and assign_10.cc
so that they will be sure to get submitted.
As usual, there are parts of this assignment that contain traps. Here are a few:
hashStringBase256
). You must reduce the hash
value yourself to make sure you don't go beyond your array
bounds.
There is more information on using C++ on Turing available in the
departmental
quick-reference guide and the
C++
quick reference guide.
You can find information about debugging in the
gdb
quick reference guide.
© 2004, Geoff Kuenning
This page is maintained by Geoff Kuenning.