CS 70

Comparisons: Fewest Probes

  • LHS Cow speaking

    So now we have all these collision-resolution strategies! We have separate chaining and three different flavors of open addressing.

  • Duck speaking

    Which one is best?

  • Octopus speaking

    Arr! Be you not payin' attention?! There's ne'er a best one!

    An' there's no point waitin', 'cause she'll never come, and you'll wake up an old pirate, sad and alone, with nowt but grog t'drown yer regrets.

  • LHS Cow speaking

    They each have their own strengths and weaknesses. Let's explore!

Separate Chaining vs. Linear Probing

Let's start by comparing the expected unsuccessful-search complexities of separate chaining and linear probing.

We plot the complexity against the load factor \( \lambda \). But remember that we usually dynamically resize our table to keep \( \lambda \) constant, so it's not the case that as you use the hash table you move along this curve. Instead, the curve indicates how much lookup will cost if you let the hash table fill up to different extents.

  • All else being equal, we prefer a higher \( \lambda \).
    • More of the table is filled, which means less “wasted” space!
    • Although, that said, empty buckets do make unsuccessful search faster.
  • All else being equal, we prefer fewer expected comparisons.
    • That (roughly speaking) means lower typical runtime!
  • So we usually have to find a sweet spot that gives us a high-enough \( \lambda \) and low-enough runtime.
    • The sweet spot will be different for different collision-resolution strategies.

  • Dog speaking

    Whoah! Separate chaining is awesome! I know you said that there isn't a best one, but that's obviously the best.

  • LHS Cow speaking

    Now, hold on. Something looks fishy here.

  • Fish speaking

    Hey, nothing wrong with fish!! But now that you mention it, it does seem like someone is trying to pull the wool over our eyes. For near 0 load factor, we have near 0 “probes” for separate chaining but near 1 for linear probing.

  • Sheep speaking

    Nothing wrong with wool. But yes, very fishy indeed.

The problem is that we were actually counting different things for these different strategies.

  • For separate chaining we counted comparisons. When the table is empty, we do 0 comparisons.
  • For linear probing we counted probes. When the table is empty, we do 1 probe.

So our curves aren't plotted on the same y-axis!

Let's fix that. Here is the same plot, but measuring comparisons for both.

  • Dog speaking

    Okay, but still! Look at how flat that separate-chaining curve is! It's… so… beautiful…

  • LHS Cow speaking

    I don't know. Something still doesn't feel right about this.

Say we have a five-bucket table of ints. Now imagine that we insert three keys that all hash to bucket 0. Let's consider the memory usage of our different strategies.

If our table uses an open addressing technique (e.g., linear probing), how many spots in memory does it use?

(By a “spot” in memory, we mean a place where we can store somemthing, like a key.)

If our table uses separate chaining, how many spots in memory does it use?

(By a “spot” in memory, we mean a place where we can store somemthing, like a pointer, or a linked-list node.)

See, the separate-chaining table uses more memory!

It's almost as if we are giving it more buckets as we add more keys.

So really, although we've been labeling them both \( \lambda \), the “load factor” has different meanings for the two different kinds of table! I guess we should have been tipped off before, because \( \lambda \) can be anything for separate chaining, but for open addressing, \( \lambda \) can never be greater than one.

So if we want our plot to give us a meaningful comparison, we need to get both curves on the same \( x \)-axis.

To make things fair, we should calculate load factor as if separate chaining has more buckets. Specifically, let

\[ M_{\mathrm{oa}} = M_{\mathrm{sc}} + N, \]

where \( M_{\mathrm{sc}} \) is the number of buckets in the separate chaining table and \( M_{\mathrm{oa}} \) is the adjusted number of buckets to match an open addressing table.

Now just a little algebra will give us a formula to convert the separate chaining load factor \( \lambda_{\mathrm{sc}} \) into an adjusted load factor \( \lambda_{\mathrm{oa}} \) to match open addressing.

\[ \lambda_{\mathrm{oa}} = \frac{N}{ M_{\mathrm{oa}} } = \frac{N}{ M_{\mathrm{sc}} + N } = \left( \frac{N}{ M_{\mathrm{sc}} + N } \cdot{} \frac{ \frac{1}{ M_{\mathrm{sc}} } }{ \frac{1}{ M_{\mathrm{sc}} } } \right) = \frac{ \lambda_{\mathrm{sc}} }{ 1 + \lambda_{\mathrm{sc}} } \]

And here is a plot of number of comparisons vs. \( \lambda_{\mathrm{oa}} \). Now both curves have the same units for \( x \) and \( y \)!

  • Dog speaking

    Oh. Well, okay. I guess separate chaining isn't so amazing. But it is better!

  • LHS Cow speaking

    Yeah, the curve is a little bit lower. Remember, that doesn't really mean that separate chaining is faster in some general sense.

  • RHS Cow speaking

    It means that to achieve a certain same number of expected comparisons, linear probing needs to have a lower load factor.

  • LHS Cow speaking

    Usually a higher load factor means less wasted space, so that's an advantage of separate chaining!

Here we see the same two compared for successful search.

  • Duck speaking

    Whoah! They're the same??

  • LHS Cow speaking

    Yup! After we make our adjustments, we see that the expected number of comparisons for successful search is the same for both linear probing and separate chaining.

  • RHS Cow speaking

    In a way it kind of makes sense. In linear probing, the clusters are kind of like chains.

All Approaches

Here are plots for unsuccessful and successful search with all of our open addressing methods.

  • Horse speaking

    Hay! The line for quadratic probing continues on all the way to 1, but that method could get stuck in a loop if \( \lambda \) goes beyond 0.5!

  • LHS Cow speaking

    True.

  • Goat speaking

    And below 0.5, it's only a little bit better than linear probing! Meh, I think that quadratic probing is hot garbage.

  • Duck speaking

    So double hashing is the best then!

  • LHS Cow speaking

    Remember, we're just telling one piece of the story here.

  • RHS Cow speaking

    Double hashing is more complex to implement, and needs two (independent) bits of random-seeming data, a starting bucket and an increment.

  • Dog speaking

    I like separate chaining best. Nice and easy to write, and the number of comparisons seems good enough.

  • Bjarne speaking

    I do, too. std::unordered_set uses separate chaining.

  • Pig speaking

    But maybe there are even MORE trade-offs to consider…?

  • LHS Cow speaking

    Indeed… Let's take a look…

  • Hedgehog speaking

    Oh, no…

(When logged in, completion status appears here.)