CS 70

Phase 6: HashSet<T> Performance Statistics

We're almost done! In this final coding step, you will implement some additional HashSet member functions that provide information about the performance of a particular hash table. All the functions implemented in this step must execute in \( \Theta(1) \) time. Most likely, implementing these functions will also involve making changes to other functions that you have already implemented.

Let's Go!

  • RHS Cow speaking

    The Helpful Hints below contain some information that isn't just helpful, it's pretty essential.

  • Goat speaking

    Uh, maybe you should call them Essential Information then!

  • LHS Cow speaking

    And don't forget to have the Specifications for HashSet<T> nearby!

For this phase you will plan, write tests for, implement, and test the following functions:

  • collisions()
  • maximal()

Once these functions are written, you will also be able to use the provided function

  • showStatistics()

collisions()

This function returns the number of times that, when inserting a key, the bucket it was placed in already had at least one other thing in it (i.e., wasn't empty).

For a newly created HashSet, collisions() should return 0.

This function should include collisions encountered during reallocation! So, in an extreme example, if all \( n \) keys hashed to bucket 0, then there should be \( n - 1 \) collisions. If we insert the \( ( n + 1 )^{\mathrm{th}} \) key (which also hashes to 0); and if, during reallocation, the \( n + 1 \) keys still all hash to bucket 0; then there should be \( n \) collisions. Of course, in other cases the number of collisions might decrease after a reallocation.

maximal()

This function returns the maximum number of steps taken to find a key since the last time the table was reallocated.

This value is a lower bound on the length of the longest chain. The reason we don't just keep track of the longest chain is that std::forward_list doesn't offer a size function. So the only way to find the size of a chain is to loop through the whole list (expensive) or to store the size separately for each bucket (takes up space).

For a newly created HashSet, maximal() should return 0.

Whenever you perform a lookup, you will take some number of steps in a chain to find the key (or not). If that number of steps is greater than the known maximum number of steps, you should update the maximum number of steps. In an extreme case where all \( n \) keys hash to bucket 0, if you perform a failed lookup on bucket 0, then afterwards maximal() should return \( n \).

If you perform an insertion, you will perform a lookup to prevent duplicate keys. If you do end up inserting the key, you should consider the length of that chain increased by one (since you are adding a key to it). In an extreme case, where all \( n \) keys hash to bucket 0 and you successfully insert a key into bucket 0, then afterwards maximal() should return \( n + 1 \).

During a rehash(), if the table is reallocated, the keys are rearranged and you don't perform lookups, so we don't get a good estimate of the maximum number of steps. In that case, rehash() should set maximal() to the smallest value it could possibly be, which is the ceiling of the load factor.

Planning Image

Add your planning image named Phase_6.jpg to the Planning directory.

Helpful Hints

How to Update Statistics Inside a const Function

You might wonder how a const method for searching your hash table can update the hash-table statistics for maximal(), as const methods aren't supposed to change their object.

C++ has an escape hatch for exactly this purpose. If you declare a data member as mutable in your header file; for example, as

mutable size_t maximal_;

C++ will let you update the data member even within a const method. This feature is very useful for cases like we have here, where calling exists() shouldn't change the observable behavior of the set (in terms of what is or is not in the set), but we do want it to update private bookkeeping information.

Carefully Choose Your Hash Function(s) for Testing

As we mentioned before, a good hash function produces random-looking values, which isn't very good for testing.

Instead, think about simple hash functions that will cause predictable behavior. You don't know when the hash table will reallocate, but you can decide how many buckets it starts with. You can also pick a hash function that allows you to arrange keys in particular ways. For example, if two keys have the same hash function, you know they will end up in the same bucket (even if a reallocation occurs).

Try to arrange test cases by picking the number of buckets, the hash function, and the keys you insert so that you know how a correct HashSet must behave when one or more of these functions is called. In some cases you can arrange things so that it does matter whether the reallocation has occurred or not. Other times, you might need to consider both possibilities in your test case.

To Complete This Part of the Assignment…

You'll know you're done with this part of the assignment when you've done all of the following:

(When logged in, completion status appears here.)