CS 70

Collision Resolution: Linear Probing

  • LHS Cow speaking

    Remember we talked about collisions last time?

  • Cat speaking

    Sure! We solved that by letting multiple keys be in one bucket.

  • LHS Cow speaking

    Yep! Now we are going to consider a different strategy.

  • RHS Cow speaking

    With open addressing, the table has one key per bucket, but a key can go into a different bucket if its assigned bucket is full.

Thinking it Through

We'll hash our data using the djb2 hash function we introduced at the end of last lesson. Here are some foods we might store and their hash values:

key djb2 hash Value
avocado 1239543842
eggs 2090218923
pancakes 186607115
sausage 2196978158
waffles 3049638669

Let's suppose we've inserted everything except steak into our hash table. So far, so good. (Recall that our hash table is size 10, so when we take the hash value mod 10, it's (conveniently) just the last digit of the numbers shown above).

bucket key
0  
1  
2 avocado
3 eggs
4  
5 pancakes
6  
7  
8 sausage
9 waffles

Now let's insert steak.

  • The djb2 hash value is 274815133

  • So the bucket would be: ( 274815133 % 10 ) = 3.

  • LHS Cow speaking

    You can see that eggs is already in that bucket. So where should steak go?

  • Cat speaking

    How about… the next bucket?

  • LHS Cow speaking

    Sure, let's do that!

Here's the new table:

bucket key
0  
1  
2 avocado
3 eggs
4 steak
5 pancakes
6  
7  
8 sausage
9 waffles
  • LHS Cow speaking

    So if I wanted to look up steak, what should happen?

  • Cat speaking

    First we hash steak and get bucket 3, but steak isn't there.

  • LHS Cow speaking

    Right.

  • Cat speaking

    So we look at the next bucket, and it is there!

  • LHS Cow speaking

    Sounds reasonable to me!

Now let's insert tofu.

  • The djb2 hash value is 2090766659

  • So the bucket would be: ( 2090766659 % 10 ) = 9.

  • LHS Cow speaking

    Once again, that bucket is occupied. Where should tofu go?

  • Cat speaking

    Maybe bucket 0?

  • LHS Cow speaking

    That's right, we need to wrap around.

Linear Probing

The strategy we've started to invent above is called linear probing.

The Big Idea: Collisions go to the next free bucket.

In practice that means we have a sequence of “probes” to find the right bucket.

If \( H \) is the hash value for our key, we check buckets \( H, H + 1, H + 2, \ldots{} \).

Details

  • For insert, follow the probing sequence until either
    1. The key is found, in which case do not insert; or,
    2. An empty bucket is found, in which case the key will go in that bucket.
  • For lookup, follow the probing sequence until either,
    1. The key is found, in which case we can return true; or,
    2. An empty bucket is found, in which case the key is not in the table.
  • If the probing sequence goes past the last bucket, wrap around to the first bucket.

Example

Start with a table with 20 buckets. Our keys are integers and our hash function is just the identity function, so the mapping table buckets is really simple: key % 20.

Draw the table on a piece of paper, then insert the following keys in this order:

23, 60, 77, 34, 0, 39, 3, 9, 87, 25, 24, 1, 21

List the contents of the table; put a single space in between each key and use an underscore (_) for empty buckets.

The video below explains the solution…

So we noticed a problem in the last example: our keys all started to pile up in one part of the table!

What is the worst-case complexity of lookup in a hash table using linear probing?

In the worst case, lookup takes \( \Theta(n) \) time!

Imagine a terrible scenario where the table is almost full. It might be that we are searching for a key that hashes to bucket 0, but that key has been placed in the last bucket! Or, for an unsuccessful search, we might be searching for a key that hashes to bucket 0, but the first blank bucket might be the last bucket.

In either case, we would perform \( n \) probes before we could conclude that the key was or wasn't in the table: \( \Theta(n) \) time in the worst case.

Primary Clustering

We see that when we have lots of keys all in a row, inserts and lookups take longer.

Having lots of keys in a row is called primary clustering.

Primary Clustering Slows Down Operations

For example, consider two hash tables with \( \lambda = 0.5 \) (blue boxes represent filled buckets):

                   
Table 1. Unclustered.
                   
Table 2. Clustered.

Now let's figure out the expected complexity of an unsuccessful search in tables with these configurations.

We'll assume that our hash function uniformly randomly assigns keys to buckets, so we are equally likely to start our search in any of the buckets.

In general, we can think of the expected complexity as

\[ \sum^{M}_{b=1} \mathrm{Pr}(b)C(b) \]

That is, we sum over the buckets. For each bucket \( b \), we calculate the probability of our key hashing to that bucket, \( \mathrm{Pr}(b) \), and also the cost of an unsuccessful search starting in that bucket, \( C(b) \).

We assume that every bucket is equally likely, so \( \mathrm{Pr}(b) = \frac{1}{M} = \frac{1}{2N}\) for every \( b \).

Let's calculate the expected number of probes for each of these specific tables.

What is the expected number of probes for an unsuccessful search in Table 1? (Note that a “probe” is “looking in a bucket”, not “looking at a key”—looking at an empty bucket is still a probe.)

Remember that an unsuccessful search always ends at an empty bucket (which indicates that the key is not present).

So \( C(b) \) is either 1 or 2, depending on which bucket we start at:

  • If the key is hashed to an empty bucket, it takes only one probe to determine that it is empty.
    • There are \( N \) of these
  • If the key is hashed to a full bucket, we must probe that bucket and the next one.
    • There are \( N \) of these

So, all together, we've got

\[ \mathrm{E}[C_{\rm UL}] = \frac{1}{2N}(1) + \frac{1}{2N}(2) + \frac{1}{2N}(1) + \frac{1}{2N}(2) + \cdots. \]

Collecting the terms, we see that

\[ \mathrm{E}[C_{\rm UL}] = \frac{N}{2N}(1) + \frac{N}{2N}(2) = \frac{3N}{2N} = 1.5. \]

What is the expected number of probes for an unsuccessful search in Table 2?

(You might have been able to “just know” the answer to the last question, but this one is a bit trickier. You might want to get out a piece of paper and work through it using similar reasoning to the kind we used for the last question.)

For the empty buckets, an unsuccessful search takes only 1 probe. There are \( N \) of those, each with a probability of \( \frac{1}{2N} \).

For the full buckets, it depends on which bucket we start at. The first bucket takes \( N + 1 \) probes (\( + 1 \) for the empty bucket at the end), the next takes \( N \) probes, the next \( N - 1 \), and so on.

So, all together, the expected number of probes is

\[ \mathrm{E}[C_{\rm UL}] = \underbrace{\left(\frac{1}{2N} + \frac{1}{2N} + \cdots{} \frac{1}{2N}\right)}_{N \; \text{terms}} + \left( \frac{N + 1}{2N} + \frac{N}{2N} \frac{N - 1}{2N} + \cdots{} + \frac{2}{2N} \right). \]

Collecting all those terms together, we get

\[ \begin{aligned} \mathrm{E}[C_{\rm UL}] =& N\left(\frac{1}{2N}\right) + \sum^{N}_{b=1} \left( \frac{ b + 1 }{2N} \right)\\ =& \frac{1}{2} + \frac{1}{2N} \sum^{N}_{b=1} ( b + 1 ). \end{aligned} \]

Splitting the terms of the summation up, we get

\[ \begin{aligned} \mathrm{E}[C_{\rm UL}] &= \frac{1}{2} + \left(\frac{1}{2N} \cdot{} \sum^{N}_{b=1} b\right) + \frac{N}{2N}\\ &= 1 + \frac{1}{2N} \cdot{} \sum^{N}_{b=1} b. \end{aligned} \]

Finally, we can apply Gauss' formula and simplify:

\[ \begin{aligned} \mathrm{E}[C_{\rm UL}] &= 1 + \frac{1}{2N} \cdot{} \frac{ ( N + 1 ) N }{2}\\ &= 1 + \frac{ N + 1 }{4}\\ &= 1 + \frac{N}{4} + 0.25\\ &= 1.25 + \frac{N}{4}\\&\approx \frac{N}{4}. \end{aligned} \]

.

So even though they have the same load factor, the expected number of comparisons for an unsuccessful search is

  • \( \Theta(1) \) in Table 1; but,
  • \( \Theta(n) \) in Table 2.

Key Idea: The complexity of search is impacted by clustering, not just by \( \lambda \) itself.

Primary Clustering is Self-Perpetuating

Worse still, clusters beget clusters!

  1. The bigger a cluster, the more likely it is that a key will be hashed into it.
  2. The more inserted keys that hash into the cluster, the bigger the cluster gets!
  3. GOTO 1

So primary clustering is the main downside to linear probing.

Expected Complexity

Because of the effects of clustering, the expected complexity analysis for linear probing is complicated.

We won't go through it in this course, but here's the upshot.

Expected number of probes for successful lookup (key in the table):

\[ \mathrm{E}[C_{\rm SL}] \approx \frac{1}{2} \left( 1 + \frac{1}{ 1 - \lambda } \right) . \]

Expected number of probes for unsuccessful lookup (key not in the table): \[ \mathrm{E}[C_{\rm UL}] \approx \frac{1}{2} \left( 1 + \frac{1}{ ( 1 - \lambda )^{2} } \right) . \]

  • Hedgehog speaking

    Gah!

  • LHS Cow speaking

    Sorry to startle you! Don't worry; you don't need to memorize these formulae.

  • RHS Cow speaking

    However, there are some practical implications that are important to know.

Complexity Observations

  • As \( \lambda \) approaches 1, the expected number of probes approaches infinity!
    • Saying it's infinity is partly an artifact of using \( \lambda \), in reality it approaches \( N \).
    • But it does capture the idea that as lambda gets closer to 1, clustering gets worse and worse. On the other hand, when there aren't many items in the hash table, there won't be much clustering at all.
    • So, based on these insights, keep \( \lambda \) a fair bit below 1 (e.g., ensure \( \lambda \lt 0.5 \)) when using linear probing.
  • We can still dynamically resize our table to keep \( \lambda \) roughly constant. In that case…
    • Lookup has expected \( \Theta(1) \) complexity.
    • Insert has amortized expected \( \Theta(1) \) complexity.
  • However, the worst-case complexity of resizing is worse than it is for separate chaining.
    • Each time we insert a key into the new table, it could take \( \Theta(k) \) time, where \( k \) is the number of keys moved so far.
    • That means the worst case time of rehashing all \( n \) keys is \( \Theta( 1 + 2 + 3 + \cdots{} + n ) = \Theta(n^2) \).
      • The worst case time for insertion without resize is \( \Theta(n) \)
    • As a result, for a dynamically resized table, these are true of insert
      • \( \Theta( n^2 ) \) worst case
      • \( \Theta(n) \) amortized worst case (because the resize happens so rarely)
    • Note:
      • Remember that, with a decent hash function, the worst case is super rare anyway, so increasing the worst case time is not a huge deal.
      • There are clever things to do that can improve the complexity of rehashing but we won't cover them here.

(When logged in, completion status appears here.)