CS 70

Phase 2: Evaluating Hash Functions

A very important part of using a hash table is choosing a hash function!

In this assignment, users of your HashSet<T> will need to provide a myhash function for their type.

In this part, you'll be exploring possible hash functions for strings, and develop a myhash() function for std::strings that users can use if they don't want to come up with one themselves.

Let's Go!

Find Hash Functions

Find (at least) three distinct hash functions for strings and add their implementations to stringhash.cpp (replacing the three placeholder functions currently there). Be sure to update the contents of the hashInfo list at the bottom of stringhash.cpp as described in the comment above hashInfo.

You may copy code from any source for the hash functions except

  1. The work of current or prior CS 70 students; and,
  2. Any implementation of the C++ standard library's std::hash function.

Our lessons have described one straightforward hash function for strings and you can find many more examples in textbooks, Wikipedia, and other online sources. You must adhere to the following rules:

  • Ensure that your version of the hash funciton takes only a std::string and returns a size_t value.
    • Some example code you find may use C-style strings. Adapting that code to use C++ strings will require some changes to the code; see the hints below.
    • Similarly, some examples include the table size as an additional parameter. In most cases, you can easily remove that part of the code as it will just have % tableSize at the end of the function.
  • Avoid hash functions that require additional libraries or significant support infrastructure. The hash function should be a self-contained function that can easily be copied into your code.
  • Provide attribution
    • Whenever your code is based on someone else's, you must be clear about its provenance, even if you make substantial modifications (e.g., converting code that worked on C strings to work with C++'sstd::string.
      • Failing to provide attribution is plagiarism and is an Honor Code violation.
    • Also, if the code has an open-source license, you should note which one.
      • Note that it is our understanding that submitting a homework assignment as part of an educational experience does not count as “distribution” under software-licensing rules.
  • Only obtain code for hash functions, nothing else!
    • You may not copy code for other parts of the assignment, including hash tables themselves.

Unlike any other parts of assignments in the class, you may use ChatGPT to help you find hash functions (but not any other parts of the assignment). If you do, you must provide attribution to ChatGPT in your code.

Thus, you may use ChatGPT to

  • Suggest sources to find C++ hash functions,
  • Adapt code written for other languages to work with a std::string parameter and return a size_t value, or,
  • Create innovative new hash functions based on its knowledge of the subject area.

A sample conversation can be found here which includes several example hash functions that you can use, but you can also just use the chat as inspiration to guide your own conversation with ChatGPT.

Determine Hash-Function Properties

If you run make stringhash-test, a program named stringhash-test will be generated. (The source code for this program is not part of the starter code for the assignment, but it is included on the CS 70 server). Use this program to determine the properties of your selected hash functions.

The stringhash-test program takes two command-line arguments:

  1. The number of hash-table buckets to simulate; and,
  2. The file name of a dictionary of words.

In the data in your home directory, we have provided three dictionary files for you to use:

  • smalldict.words (about 350 words)
  • ispell.words (about 35,000 words)
  • web2 (about 250,000 words)

Although you can try a small dictionary and a small size, if you've got a decent hash function, you can just go ahead and try the web2 word list with a large number of buckets, like this:

./stringhash-test 150000 ../../data/web2

As an example of good performance, here are the result of the running against the web2 file for the C++ standard library's provided hash function:

== Testing std::hash<std::string>

Inserted 234937 items in 47 ms

For 150000 buckets:
  - Empty bucket percentage:            20.9027%
  - Min items per bucket:               0
  - Max items per bucket:               10
  - Average items per bucket:           1.56625
  - Standard deviation:                 1.00081
  - Average items per non-empty bucket: 1.98015

For 'infinite' buckets (actually 2^32):
  - Expected: 6.42533 hash collisions, stddev 2.53482
  - Actual:   234930 unique hashes, 7 hash collisions:
  - Specifically:
        4243118819: subrhomboid soulcake
        687986930: Pleurotomaria neogenesis
        1985490478: predynamite blasphemy
        3339483196: doxologize bemoaningly
        418121517: phlebitic bebannered
        1406104780: supertranscendent Koreishite
        1751510580: Tat Occamist

Average avalanche:
        32.0234/64 bits for one bit change in 1 character key
        31.9893/64 bits for one bit change in 2 character key
        31.9989/64 bits for one bit change in 4 character key
        31.9997/64 bits for one bit change in 8 character key
        32.0005/64 bits for one bit change in 16 character key
        31.9943/64 bits for one bit change in 32 character key
        31.9986/64 bits for one bit change in 64 character key
        32.0072/64 bits for one bit change in 128 character key
        32.0046/64 bits for one bit change in 256 character key

Interpreting the Results

The first part of the test shows the kind of behavior you might see in a hash table in practice. For example, in the results above, theory tells us to expect 20.8826% of the buckets to be empty, and the results are almost exactly that value.

The second part imagines a high hash table with 232 buckets. The number of collisions should be very small, and there should be no obvious similarity between the few words that collide.

The final test provides “avalanche” information. One of the properties we'd like hash functions to have is that very similar keys don't produce similar hash values. We want a nice uniform distribution, not hash values clustering together.

The avalanche test tells you the result of the question, “If we hash a string of a specific length and then change just one bit at random, how different is the new hash value from the original hash value (in terms of the average number of bits in the 64-bit number that change)?” Ideally, half the bits should change, and you can see that std::hash does a good job of meeting that goal.

The test also measures speed. Both hashing speed and the spread of hash values impact the performance of a hash table. Because hashing is so quick overall, speed results will probably only be meaningful for a large dictionary like web2. Also, it's only meaningful to compare the speed of good hash functions (for technical reasons, because a bad hash function maps to fewer unique buckets, it may get an “unfair” speed advantage in this particular test).

Choose and Explain

Imagine that you were going to use your hash table to implement a spell checker. It would store English-language words and then be used to look up strings to see if they are in the dictionary. Using these results, decide which one of your three hash functions you would select in that scenario.

Briefly document the rationale for your choice of hash function as a block comment in stringhash.cpp. There is a TODO placeholder in the spot where your explanation should go.

Note that there is no “best” hash function! Hash functions all exist in a space of trade-offs. So there isn't a “correct” answer to this question, but your rationale should discuss the trade-offs that you are making with your selection. We are looking for sensible reasoning and thoughtful comparison, not for some objectively correct response.

Helpful Hints

If you find a hash function written in C, it will probably look something like

unsigned int simpleAddHash(const char* str) {
    size_t hash = 0;

    // C strings end with a zero byte (`\0`), so we loop until we see it.
    for (char* p = str; *p != '\0'; ++p) {
        hash = hash + *p;
    }

    return hash;
}

It might even look like this code, which is equivalent to the above, but “cleverer”:

unsigned int simpleAddHash(const char* str) {
    size_t hash = 0;
    char c;

    while ((c = *str++)) {
        hash = hash + c;
    }

    return hash;
}
  • Hedgehog speaking

    That second version seems hard to understand.

  • Duck speaking

    I get it, the = is assignment, putting *str into c, and the loop exits when c is zero. Does it run faster that way?

  • LHS Cow speaking

    No. Quite a lot of C programmers think this kind of code will somehow run faster. Maybe that was true in 1972, but it isn't true with a modern compiler. Some C programmers also think that this kind of cryptic code looks impressive.

  • Duck speaking

    Maybe they think it's is cool because when you finally figure out what it's doing, you have a “aha!” moment?

  • Goat speaking

    Meh. I'm not impressed by gibberish code.

You would need to convert functions like the ones above to use a C++ std::string, like

size_t simpleAddHash(const string& str) {
    size_t hash = 0;

    for (unsigned char c : str) {
        hash = hash + c;
    }

    return hash;
}
  • LHS Cow speaking

    Similarly, some C programmers might write

    x = (x << 2) + x;
    

    instead of

    x = x * 5;
    

    thinking that a binary shift and an addition is faster than a multiplication.

    But they're mistaken. The compiler already understands that 5*x and 4*x + x are the exact same thing. Writing code this way just makes the code needlessly cryptic.

  • RHS Cow speaking

    Since this class isn't about converting weird cryptic falsely optimized C code to C++, you can absolutely get ChatGPT's help to do this conversion.

To Complete This Part of the Assignment…

You'll know you're done with this part of the assignment when you've done all of the following:

(When logged in, completion status appears here.)