Phase 6: HashSet<T>
Performance Statistics
We're almost done! In this final coding step, you will implement some additional HashSet
member functions that provide information about the performance of a particular hash table. All the functions implemented in this step must execute in \( \Theta(1) \) time. Most likely, implementing these functions will also involve making changes to other functions that you have already implemented.
Let's Go!
The Helpful Hints below contain some information that isn't just helpful, it's pretty essential.
Uh, maybe you should call them Essential Information then!
And don't forget to have the Specifications for
HashSet<T>
nearby!
For this phase you will plan, write tests for, implement, and test the following functions:
collisions()
maximal()
Once these functions are written, you will also be able to use the provided function
showStatistics()
collisions()
This function returns the number of times that, when inserting a key, the bucket it was placed in already had at least one other thing in it (i.e., wasn't empty).
For a newly created HashSet
, collisions()
should return 0.
This function should include collisions encountered during reallocation! So, in an extreme example, if all \( n \) keys hashed to bucket 0, then there should be \( n - 1 \) collisions. If we insert the \( ( n + 1 )^{\mathrm{th}} \) key (which also hashes to 0); and if, during reallocation, the \( n + 1 \) keys still all hash to bucket 0; then there should be \( n \) collisions. Of course, in other cases the number of collisions might decrease after a reallocation.
maximal()
This function returns the maximum number of steps taken to find a key since the last time the table was reallocated.
This value is a lower bound on the length of the longest chain. The reason we don't just keep track of the longest chain is that std::forward_list
doesn't offer a size function. So the only way to find the size of a chain is to loop through the whole list (expensive) or to store the size separately for each bucket (takes up space).
For a newly created HashSet
, maximal()
should return 0.
Whenever you perform a lookup, you will take some number of steps in a chain to find the key (or not). If that number of steps is greater than the known maximum number of steps, you should update the maximum number of steps. In an extreme case where all \( n \) keys hash to bucket 0, if you perform a failed lookup on bucket 0, then afterwards maximal()
should return \( n \).
If you perform an insertion, you will perform a lookup to prevent duplicate keys. If you do end up inserting the key, you should consider the length of that chain increased by one (since you are adding a key to it). In an extreme case, where all \( n \) keys hash to bucket 0 and you successfully insert a key into bucket 0, then afterwards maximal()
should return \( n + 1 \).
During a rehash()
, if the table is reallocated, the keys are rearranged and you don't perform lookups, so we don't get a good estimate of the maximum number of steps. In that case, rehash()
should set maximal()
to the smallest value it could possibly be, which is the ceil
ing of the load factor.
Planning Image
Add your planning image named Phase_6.jpg
to the Planning
directory.
Helpful Hints
How to Update Statistics Inside a const
Function
You might wonder how a const
method for searching your hash table can update the hash-table statistics for maximal()
, as const
methods aren't supposed to change their object.
C++ has an escape hatch for exactly this purpose. If you declare a data member as mutable
in your header file; for example, as
mutable size_t maximal_;
C++ will let you update the data member even within a const
method. This feature is very useful for cases like we have here, where calling exists()
shouldn't change the observable behavior of the set (in terms of what is or is not in the set), but we do want it to update private bookkeeping information.
Carefully Choose Your Hash Function(s) for Testing
As we mentioned before, a good hash function produces random-looking values, which isn't very good for testing.
Instead, think about simple hash functions that will cause predictable behavior. You don't know when the hash table will reallocate, but you can decide how many buckets it starts with. You can also pick a hash function that allows you to arrange keys in particular ways. For example, if two keys have the same hash function, you know they will end up in the same bucket (even if a reallocation occurs).
Try to arrange test cases by picking the number of buckets, the hash function, and the keys you insert so that you know how a correct HashSet
must behave when one or more of these functions is called. In some cases you can arrange things so that it does matter whether the reallocation has occurred or not. Other times, you might need to consider both possibilities in your test case.
(When logged in, completion status appears here.)