CS 70

Collision Resolution: Double Hashing

  • Goat speaking

    Meh. Quadratic probing still has some danger of clustering.

  • Dog speaking

    And it can totally fail if the load factor gets too high!

  • Pig speaking

    Maybe there is one MORE technique we could learn to fix all that?

  • LHS Cow speaking

    Okay…

In quadratic probing, we saw that if multiple keys hash to the same bucket they will cluster along the probe sequence. To avoid that issue, we'll give each key its own probe sequence!

Specifically, in double hashing we use two hash functions. Let's name their outputs as follows:

  • \( H \) is the value from the first hash function, \( \mathrm{H}_1(\mathit{key}) \), and provides the bucket (as usual)
  • \( I \) is the value from the scond hash function, \( \mathrm{H}_2(\mathit{key}) \), and provides the step (or increment) for the probe sequence.

So our probing sequence will be

\[ H, ( H + I ), ( H + 2I ), ( H + 3I ), \ldots{} . \]

Obviously, it's very important that \( I \) never be \( 0 \)!

If both of our hash functions are good, this approach will combat primary clustering by having different probes jump different distances and secondary clustering by giving different keys that hash to the same initial bucket different probing sequences.

  • Dog speaking

    Problem solved!

  • Horse speaking

    Woah, woah… What would happen if we had 100 buckets and the increment was 50?

  • LHS Cow speaking

    Right… We would only need to go two steps before we're back where we started.

If the increment, \( I \), and the number of buckets, \( M \), have any factors in common, the number of buckets we can examine is reduced by that factor. In this case, \( \mathrm{GCD}(M,I) = \mathrm{GCD}(100,50) = 50 \), so we have a 50× reduction in the amount of time to get back where we started. (\( \mathrm{GCD} \) is the greatest common divisor function.)

Can you think of a way we could avoid this problem?

(Hint: Impose a constraint on a number like \( M \) or \( I \) that reduce or prevent their having a divisor in common).

If \( M \) is the number of buckets, and \( I \) is our increment (from \( \mathrm{H}_2 \)), then we want to be the case that \( \mathrm{GCD}(M,I) = 1 \). But how…?

Here are two options:

  • Make \( M \) prime, and ensure that \( 1 \le I < M \).
  • Make \( M \) a power of two, \( 1 \le I < M \) and \( I \) is odd.

The second approach avoids the hassle of coming up with a prime hash-table size, but it cuts the number of possible probe patterns in half. Nevertheless, there will still be a lot of possible patterns when \(M\) is large.

  • Hedgehog speaking

    I don't really want to write two hash functions though.

  • LHS Cow speaking

    Actually, although much of the literature on double hashing talks about two hash functions, you really only need one—it just needs to produce both \(H\) and \(I\).

  • Duck speaking

    What?

Suppose someone tells you that you need to have two dice and roll them. You could do that, for sure, but if you had a 36-sided die, you could just roll that one and get the answer for both rolls.

  • Horse speaking

    Woah, what? How does a 36-sided die substitute for two 6-sided dice? You can roll at most twelve with two dice.

  • LHS Cow speaking

    We're not adding the dice rolls, they're separate rolls. Image a 6 × 6 square, it has 36 spots. Your 36-sided dice can be used to select a specific spot in the square, and its x and y coordinates give you the two distinct rolls.

Similarly, if our hash function produces numbers in a large enough range, it can provide both the bucket and the increment.

Expected Complexity

Again, we'll just report the expected complexity without deriving it. (Technically, this analysis is based on the idea that each key has an entirely random (but nonrepeating) probe sequence; it is claimed that double hashing is a good enough approximation of that behavior).

Expected Number of Probes for Successful Lookup (Key in Table)

\[ \mathrm{E}[C_{\rm SL}] = \frac{1}{\lambda} \ln{ \left( \frac{1}{1 - \lambda } \right) } \]

Expected Number of Probes for Unsuccessful Lookup (Key Not in Table)

\[ \mathrm{E}[C_{\rm UL}] \approx \frac{\lambda}{ 1 - \lambda } + 1 \]

(When logged in, completion status appears here.)