Comparisons: Fewest Probes
So now we have all these collision-resolution strategies! We have separate chaining and three different flavors of open addressing.
Which one is best?
Arr! Be you not payin' attention?! There's ne'er a best one!
An' there's no point waitin', 'cause she'll never come, and you'll wake up an old pirate, sad and alone, with nowt but grog t'drown yer regrets.
They each have their own strengths and weaknesses. Let's explore!
Separate Chaining vs. Linear Probing
Let's start by comparing the expected unsuccessful-search complexities of separate chaining and linear probing.
We plot the complexity against the load factor \( \lambda \). But remember that we usually dynamically resize our table to keep \( \lambda \) constant, so it's not the case that as you use the hash table you move along this curve. Instead, the curve indicates how much lookup will cost if you let the hash table fill up to different extents.
- All else being equal, we prefer a higher \( \lambda \).
- More of the table is filled, which means less “wasted” space!
- Although, that said, empty buckets do make unsuccessful search faster.
- All else being equal, we prefer fewer expected comparisons.
- That (roughly speaking) means lower typical runtime!
- So we usually have to find a sweet spot that gives us a high-enough \( \lambda \) and low-enough runtime.
- The sweet spot will be different for different collision-resolution strategies.
Whoah! Separate chaining is awesome! I know you said that there isn't a best one, but that's obviously the best.
Now, hold on. Something looks fishy here.
Hey, nothing wrong with fish!! But now that you mention it, it does seem like someone is trying to pull the wool over our eyes. For near 0 load factor, we have near 0 “probes” for separate chaining but near 1 for linear probing.
Nothing wrong with wool. But yes, very fishy indeed.
The problem is that we were actually counting different things for these different strategies.
- For separate chaining we counted comparisons. When the table is empty, we do 0 comparisons.
- For linear probing we counted probes. When the table is empty, we do 1 probe.
So our curves aren't plotted on the same y-axis!
Let's fix that. Here is the same plot, but measuring comparisons for both.
Okay, but still! Look at how flat that separate-chaining curve is! It's… so… beautiful…
I don't know. Something still doesn't feel right about this.
Say we have a five-bucket table of int
s. Now imagine that we insert three keys that all hash to bucket 0. Let's consider the memory usage of our different strategies.
See, the separate-chaining table uses more memory!
It's almost as if we are giving it more buckets as we add more keys.
So really, although we've been labeling them both \( \lambda \), the “load factor” has different meanings for the two different kinds of table! I guess we should have been tipped off before, because \( \lambda \) can be anything for separate chaining, but for open addressing, \( \lambda \) can never be greater than one.
So if we want our plot to give us a meaningful comparison, we need to get both curves on the same \( x \)-axis.
To make things fair, we should calculate load factor as if separate chaining has more buckets. Specifically, let
\[ M_{\mathrm{oa}} = M_{\mathrm{sc}} + N, \]
where \( M_{\mathrm{sc}} \) is the number of buckets in the separate chaining table and \( M_{\mathrm{oa}} \) is the adjusted number of buckets to match an open addressing table.
Now just a little algebra will give us a formula to convert the separate chaining load factor \( \lambda_{\mathrm{sc}} \) into an adjusted load factor \( \lambda_{\mathrm{oa}} \) to match open addressing.
\[ \lambda_{\mathrm{oa}} = \frac{N}{ M_{\mathrm{oa}} } = \frac{N}{ M_{\mathrm{sc}} + N } = \left( \frac{N}{ M_{\mathrm{sc}} + N } \cdot{} \frac{ \frac{1}{ M_{\mathrm{sc}} } }{ \frac{1}{ M_{\mathrm{sc}} } } \right) = \frac{ \lambda_{\mathrm{sc}} }{ 1 + \lambda_{\mathrm{sc}} } \]
And here is a plot of number of comparisons vs. \( \lambda_{\mathrm{oa}} \). Now both curves have the same units for \( x \) and \( y \)!
Oh. Well, okay. I guess separate chaining isn't so amazing. But it is better!
Yeah, the curve is a little bit lower. Remember, that doesn't really mean that separate chaining is faster in some general sense.
It means that to achieve a certain same number of expected comparisons, linear probing needs to have a lower load factor.
Usually a higher load factor means less wasted space, so that's an advantage of separate chaining!
Here we see the same two compared for successful search.
Whoah! They're the same??
Yup! After we make our adjustments, we see that the expected number of comparisons for successful search is the same for both linear probing and separate chaining.
In a way it kind of makes sense. In linear probing, the clusters are kind of like chains.
All Approaches
Here are plots for unsuccessful and successful search with all of our open addressing methods.
Hay! The line for quadratic probing continues on all the way to 1, but that method could get stuck in a loop if \( \lambda \) goes beyond 0.5!
True.
And below 0.5, it's only a little bit better than linear probing! Meh, I think that quadratic probing is hot garbage.
So double hashing is the best then!
Remember, we're just telling one piece of the story here.
Double hashing is more complex to implement, and needs two (independent) bits of random-seeming data, a starting bucket and an increment.
I like separate chaining best. Nice and easy to write, and the number of comparisons seems good enough.
I do, too.
std::unordered_set
uses separate chaining.But maybe there are even MORE trade-offs to consider…?
Indeed… Let's take a look…
Oh, no…
(When logged in, completion status appears here.)