Phase 2: Evaluating Hash Functions
A very important part of using a hash table is choosing a hash function!
In this assignment, users of your HashSet<T>
will need to provide a myhash
function for their type.
In this part, you'll be exploring possible hash functions for strings, and develop a myhash()
function for std::string
s that users can use if they don't want to come up with one themselves.
Let's Go!
Find Hash Functions
Find (at least) three distinct hash functions for strings and add their implementations to stringhash.cpp
(replacing the three placeholder functions currently there). Be sure to update the contents of the hashInfo
list at the bottom of stringhash.cpp
as described in the comment above hashInfo
.
You may copy code from any source for the hash functions except
- The work of current or prior CS 70 students; and,
- Any implementation of the C++ standard library's
std::hash
function.
Our lessons have described one straightforward hash function for strings and you can find many more examples in textbooks, Wikipedia, and other online sources. You must adhere to the following rules:
- Ensure that your version of the hash funciton takes only a
std::string
and returns asize_t
value.- Some example code you find may use C-style strings. Adapting that code to use C++ strings will require some changes to the code; see the hints below.
- Similarly, some examples include the table size as an additional parameter. In most cases, you can easily remove that part of the code as it will just have
% tableSize
at the end of the function.
- Avoid hash functions that require additional libraries or significant support infrastructure. The hash function should be a self-contained function that can easily be copied into your code.
- Provide attribution
- Whenever your code is based on someone else's, you must be clear about its provenance, even if you make substantial modifications (e.g., converting code that worked on C strings to work with C++'s
std::string
.- Failing to provide attribution is plagiarism and is an Honor Code violation.
- Also, if the code has an open-source license, you should note which one.
- Note that it is our understanding that submitting a homework assignment as part of an educational experience does not count as “distribution” under software-licensing rules.
- Whenever your code is based on someone else's, you must be clear about its provenance, even if you make substantial modifications (e.g., converting code that worked on C strings to work with C++'s
- Only obtain code for hash functions, nothing else!
- You may not copy code for other parts of the assignment, including hash tables themselves.
Unlike any other parts of assignments in the class, you may use ChatGPT to help you find hash functions (but not any other parts of the assignment). If you do, you must provide attribution to ChatGPT in your code.
Thus, you may use ChatGPT to
- Suggest sources to find C++ hash functions,
- Adapt code written for other languages to work with a
std::string
parameter and return asize_t
value, or, - Create innovative new hash functions based on its knowledge of the subject area.
A sample conversation can be found here which includes several example hash functions that you can use, but you can also just use the chat as inspiration to guide your own conversation with ChatGPT.
Determine Hash-Function Properties
If you run make stringhash-test
, a program named stringhash-test
will be generated. (The source code for this program is not part of the starter code for the assignment, but it is included on the CS 70 server). Use this program to determine the properties of your selected hash functions.
The stringhash-test
program takes two command-line arguments:
- The number of hash-table buckets to simulate; and,
- The file name of a dictionary of words.
In the data
in your home directory, we have provided three dictionary files for you to use:
smalldict.words
(about 350 words)ispell.words
(about 35,000 words)web2
(about 250,000 words)
Although you can try a small dictionary and a small size, if you've got a decent hash function, you can just go ahead and try the web2
word list with a large number of buckets, like this:
./stringhash-test 150000 ../../data/web2
As an example of good performance, here are the result of the running against the web2
file for the C++ standard library's provided hash function:
== Testing std::hash<std::string>
Inserted 234937 items in 47 ms
For 150000 buckets:
- Empty bucket percentage: 20.9027%
- Min items per bucket: 0
- Max items per bucket: 10
- Average items per bucket: 1.56625
- Standard deviation: 1.00081
- Average items per non-empty bucket: 1.98015
For 'infinite' buckets (actually 2^32):
- Expected: 6.42533 hash collisions, stddev 2.53482
- Actual: 234930 unique hashes, 7 hash collisions:
- Specifically:
4243118819: subrhomboid soulcake
687986930: Pleurotomaria neogenesis
1985490478: predynamite blasphemy
3339483196: doxologize bemoaningly
418121517: phlebitic bebannered
1406104780: supertranscendent Koreishite
1751510580: Tat Occamist
Average avalanche:
32.0234/64 bits for one bit change in 1 character key
31.9893/64 bits for one bit change in 2 character key
31.9989/64 bits for one bit change in 4 character key
31.9997/64 bits for one bit change in 8 character key
32.0005/64 bits for one bit change in 16 character key
31.9943/64 bits for one bit change in 32 character key
31.9986/64 bits for one bit change in 64 character key
32.0072/64 bits for one bit change in 128 character key
32.0046/64 bits for one bit change in 256 character key
Interpreting the Results
The first part of the test shows the kind of behavior you might see in a hash table in practice. For example, in the results above, theory tells us to expect 20.8826% of the buckets to be empty, and the results are almost exactly that value.
The second part imagines a high hash table with 232 buckets. The number of collisions should be very small, and there should be no obvious similarity between the few words that collide.
The final test provides “avalanche” information. One of the properties we'd like hash functions to have is that very similar keys don't produce similar hash values. We want a nice uniform distribution, not hash values clustering together.
The avalanche test tells you the result of the question, “If we hash a string of a specific length and then change just one bit at random, how different is the new hash value from the original hash value (in terms of the average number of bits in the 64-bit number that change)?” Ideally, half the bits should change, and you can see that std::hash
does a good job of meeting that goal.
The test also measures speed. Both hashing speed and the spread of hash values impact the performance of a hash table. Because hashing is so quick overall, speed results will probably only be meaningful for a large dictionary like web2
. Also, it's only meaningful to compare the speed of good hash functions (for technical reasons, because a bad hash function maps to fewer unique buckets, it may get an “unfair” speed advantage in this particular test).
Choose and Explain
Imagine that you were going to use your hash table to implement a spell checker. It would store English-language words and then be used to look up strings to see if they are in the dictionary. Using these results, decide which one of your three hash functions you would select in that scenario.
Briefly document the rationale for your choice of hash function as a block comment in stringhash.cpp
. There is a TODO
placeholder in the spot where your explanation should go.
Note that there is no “best” hash function! Hash functions all exist in a space of trade-offs. So there isn't a “correct” answer to this question, but your rationale should discuss the trade-offs that you are making with your selection. We are looking for sensible reasoning and thoughtful comparison, not for some objectively correct response.
Helpful Hints
If you find a hash function written in C, it will probably look something like
unsigned int simpleAddHash(const char* str) {
size_t hash = 0;
// C strings end with a zero byte (`\0`), so we loop until we see it.
for (char* p = str; *p != '\0'; ++p) {
hash = hash + *p;
}
return hash;
}
It might even look like this code, which is equivalent to the above, but “cleverer”:
unsigned int simpleAddHash(const char* str) {
size_t hash = 0;
char c;
while ((c = *str++)) {
hash = hash + c;
}
return hash;
}
That second version seems hard to understand.
I get it, the
=
is assignment, putting*str
intoc
, and the loop exits whenc
is zero. Does it run faster that way?No. Quite a lot of C programmers think this kind of code will somehow run faster. Maybe that was true in 1972, but it isn't true with a modern compiler. Some C programmers also think that this kind of cryptic code looks impressive.
Maybe they think it's is cool because when you finally figure out what it's doing, you have a “aha!” moment?
Meh. I'm not impressed by gibberish code.
You would need to convert functions like the ones above to use a C++ std::string
, like
size_t simpleAddHash(const string& str) {
size_t hash = 0;
for (unsigned char c : str) {
hash = hash + c;
}
return hash;
}
Similarly, some C programmers might write
x = (x << 2) + x;
instead of
x = x * 5;
thinking that a binary shift and an addition is faster than a multiplication.
But they're mistaken. The compiler already understands that
5*x
and4*x + x
are the exact same thing. Writing code this way just makes the code needlessly cryptic.Since this class isn't about converting weird cryptic falsely optimized C code to C++, you can absolutely get ChatGPT's help to do this conversion.
(When logged in, completion status appears here.)