Dynamic Table Size
So what have we learned?
We've learned that hash tables are terrible!
Yes, we– Wait, what? Is that what we learned?
I think so… Expected lookup time is \( \Theta(\lambda) \) either way.
So…?
So, \( \lambda = \frac{N}{M} \), right? And that means that \( \Theta(\lambda) = \Theta(N) \)! Terrible!
Ah, I see. Good observation!
But \( \Theta(\lambda) = \Theta(N) \) is only true if \( M \) stays constant….
What if we're allowed to resize our table?
We've seen that the main contributor to the expected complexity of lookup is the load factor, \( \lambda \).
So if we want to keep lookup times low, we need to keep \( \lambda \) low!
Thus, as keys are inserted into the table (increasing \( N \)), we also need to add buckets to the table (increasing \( M \)) to keep the ratio \( \frac{N}{M} \) small.
How to Resize a Hash Table
- Allocate a new array with the required number of buckets.
- For each key in the original table, use the hash function to find its new bucket in the new table and INSERT it there.
- The new bucket is probably not the same as the old one!
- Delete the old array.
With separate chaining, we can do the resize in \( \Theta(n) \) time.
If we keep resizing the table to keep \( \lambda \) below some constant value, then \( \Theta(\lambda) = \Theta(1) \)!
We've been down this road before with array-backed lists!
We can use the same capacity doubling strategy here.
Proposal
- Set a maximum load factor \( \lambda_{\mathrm{max}} \).
- Start with a table with one bucket.
- Whenever \( \lambda = \lambda_{\mathrm{max}} \), double the size of the table.
We'll have occasional expensive resize operations, but this plan will give INSERT good amortized complexity. Of course, we were already dealing with expected complexities.
So INSERT has amortized expected \( \Theta(1) \) complexity.
Here, \( \Theta(1) \) is an amortization of the total expected complexity. It's not a statement about the worst-case complexity of INSERT (whether an individual INSERT or a sequence of them).
Specifically, we have
-
The expected complexity of a single insert is \( \mathrm{O}(n) \).
- Sometimes the expected complexity is \( \Theta(1) \) (no resize).
- Sometimes the expected complexity is \( \Theta(n) \) (resize).
-
But the expected complexity of \( m\) inserts is \( m \times \Theta(1) = \Theta(m) \).
- That's because the amortized expected complexity is \( \Theta(1) \).
-
The worst case for a single insert is still \( \Theta(n) \).
- And the worst case for \( m \) inserts is still \( \Theta(m^2) \).
If you feel okay about the idea of amortized expected complexity, then you can skip the following video.
If it still feels strange or confusing, this video might help to unpack things some more!
Hash Table Complexities
After all this work, we have the final complexities for a dynamically sized hash table with separate chaining and an ideal hash function:
- LOOKUP: expected \( \Theta(1) \) (worst-case \( \Theta(n) \)).
- INSERT: amortized expected \( \Theta(1) \) (worst-case \( \Theta(n) \)).
Wooooooo! We did it! Constant time INSERT and LOOKUP! Take that, binary-search trees!
No, no. No. That is not the right takeaway.
Those complexities do not give us constant time INSERT and LOOKUP! The fine print matters!
First, we're talking about expected complexities.
If the assumptions we made hold (i.e., our hash function uniformly distributes keys across buckets and our keys are randomly distributed), the theoretical worst-case performance is vanishingly unlikely…
But if those assumptions don't hold (i.e., we have a bad hash function or a carefully contrived set of keys that hash to the same bucket), the worst case could occur in practice.
Second, when we have a hash table that uses dynamic resizing to bound the load factor, INSERT has amortized expected complexity.
Sometimes an insert will require \( \Theta(n) \) time. Our amortization argument shows that these expensive operations occur rarely and don't exceed a constant time budget per operation.
So when we consider total time, it's as if each operation requires expected constant time. But I might not just care about total execution time.
Exactly!
So if I wrote a game that kept growing a hash table, it might stutter each time the hash table resized?
And if I wrote my own hash function, and it wasn't very good, my hash-table performance might be terrible?
Exactly. And these are the kinds of things it's good to be aware of.
Hash tables are cool even with those caveats, and as a result they're incredibly widely used.
We don't need to oversell them by overlooking the truths of their performance. Being aware of the pros and cons helps us make an informed choice. Try again, Dog!
Wooooooo! Hash tables make a different set of trade-offs than binary-search trees and might be a better choice depending on the needs of your problem!
Much better.
(When logged in, completion status appears here.)