Collision Resolution: Linear Probing

Remember we talked about collisions last time?
Sure! We solved that by letting multiple keys be in one bucket.
Yep! Now we are going to consider a different strategy.
With open addressing, the table has one key per bucket, but a key can go into a different bucket if its assigned bucket is full.

Thinking it Through

We'll hash our data using the djb2 hash function we introduced at the end of last lesson. Here are some foods we might store and their hash values:

`key`	`djb2` `hash` Value
avocado	1239543842
eggs	2090218923
pancakes	186607115
sausage	2196978158
waffles	3049638669

Let's suppose we've inserted everything except steak into our hash table. So far, so good. (Recall that our hash table is size 10, so when we take the hash value mod 10, it's (conveniently) just the last digit of the numbers shown above).

bucket	key
0
1
2	avocado
3	eggs
4
5	pancakes
6
7
8	sausage
9	waffles

Now let's insert steak.

The djb2 hash value is 274815133
So the bucket would be: ( 274815133 % 10 ) = 3.

You can see that eggs is already in that bucket. So where should steak go?
How about… the next bucket?
Sure, let's do that!

Here's the new table:

bucket	key
0
1
2	avocado
3	eggs
4	steak
5	pancakes
6
7
8	sausage
9	waffles

So if I wanted to look up steak, what should happen?
First we hash steak and get bucket 3, but steak isn't there.
Right.
So we look at the next bucket, and it is there!
Sounds reasonable to me!

Now let's insert tofu.

The djb2 hash value is 2090766659
So the bucket would be: ( 2090766659 % 10 ) = 9.

Once again, that bucket is occupied. Where should tofu go?
Maybe bucket 0?
That's right, we need to wrap around.

Linear Probing

The strategy we've started to invent above is called linear probing.

The Big Idea: Collisions go to the next free bucket.

In practice that means we have a sequence of “probes” to find the right bucket.

If \( H \) is the hash value for our key, we check buckets \( H, H + 1, H + 2, \ldots{} \).

Details

For insert, follow the probing sequence until either
1. The key is found, in which case do not insert; or,
2. An empty bucket is found, in which case the key will go in that bucket.
For lookup, follow the probing sequence until either,
1. The key is found, in which case we can return true; or,
2. An empty bucket is found, in which case the key is not in the table.
If the probing sequence goes past the last bucket, wrap around to the first bucket.

Example

Start with a table with 20 buckets. Our keys are integers and our hash function is just the identity function, so the mapping table buckets is really simple: key % 20.

Draw the table on a piece of paper, then insert the following keys in this order:

23, 60, 77, 34, 0, 39, 3, 9, 87, 25, 24, 1, 21

The video below explains the solution…

So we noticed a problem in the last example: our keys all started to pile up in one part of the table!

In the worst case, lookup takes \( \Theta(n) \) time!

Imagine a terrible scenario where the table is almost full. It might be that we are searching for a key that hashes to bucket 0, but that key has been placed in the last bucket! Or, for an unsuccessful search, we might be searching for a key that hashes to bucket 0, but the first blank bucket might be the last bucket.

In either case, we would perform \( n \) probes before we could conclude that the key was or wasn't in the table: \( \Theta(n) \) time in the worst case.

Primary Clustering

We see that when we have lots of keys all in a row, inserts and lookups take longer.

Having lots of keys in a row is called primary clustering.

Primary Clustering Slows Down Operations

For example, consider two hash tables with \( \lambda = 0.5 \) (blue boxes represent filled buckets):

Table 1. Unclustered.

Table 2. Clustered.

Now let's figure out the expected complexity of an unsuccessful search in tables with these configurations.

We'll assume that our hash function uniformly randomly assigns keys to buckets, so we are equally likely to start our search in any of the buckets.

In general, we can think of the expected complexity as

\[ \sum^{M}_{b=1} \mathrm{Pr}(b)C(b) \]

That is, we sum over the buckets. For each bucket \( b \), we calculate the probability of our key hashing to that bucket, \( \mathrm{Pr}(b) \), and also the cost of an unsuccessful search starting in that bucket, \( C(b) \).

We assume that every bucket is equally likely, so \( \mathrm{Pr}(b) = \frac{1}{M} = \frac{1}{2N}\) for every \( b \).

Let's calculate the expected number of probes for each of these specific tables.

Remember that an unsuccessful search always ends at an empty bucket (which indicates that the key is not present).

So \( C(b) \) is either 1 or 2, depending on which bucket we start at:

If the key is hashed to an empty bucket, it takes only one probe to determine that it is empty.
- There are \( N \) of these
If the key is hashed to a full bucket, we must probe that bucket and the next one.
- There are \( N \) of these

So, all together, we've got

\[ \mathrm{E}[C_{\rm UL}] = \frac{1}{2N}(1) + \frac{1}{2N}(2) + \frac{1}{2N}(1) + \frac{1}{2N}(2) + \cdots. \]

Collecting the terms, we see that

\[ \mathrm{E}[C_{\rm UL}] = \frac{N}{2N}(1) + \frac{N}{2N}(2) = \frac{3N}{2N} = 1.5. \]

For the empty buckets, an unsuccessful search takes only 1 probe. There are \( N \) of those, each with a probability of \( \frac{1}{2N} \).

For the full buckets, it depends on which bucket we start at. The first bucket takes \( N + 1 \) probes (\( + 1 \) for the empty bucket at the end), the next takes \( N \) probes, the next \( N - 1 \), and so on.

So, all together, the expected number of probes is

\[ \mathrm{E}[C_{\rm UL}] = \underbrace{\left(\frac{1}{2N} + \frac{1}{2N} + \cdots{} \frac{1}{2N}\right)}_{N \; \text{terms}} + \left( \frac{N + 1}{2N} + \frac{N}{2N} \frac{N - 1}{2N} + \cdots{} + \frac{2}{2N} \right). \]

Collecting all those terms together, we get

\[ \begin{aligned} \mathrm{E}[C_{\rm UL}] =& N\left(\frac{1}{2N}\right) + \sum^{N}_{b=1} \left( \frac{ b + 1 }{2N} \right)\\ =& \frac{1}{2} + \frac{1}{2N} \sum^{N}_{b=1} ( b + 1 ). \end{aligned} \]

Splitting the terms of the summation up, we get

\[ \begin{aligned} \mathrm{E}[C_{\rm UL}] &= \frac{1}{2} + \left(\frac{1}{2N} \cdot{} \sum^{N}_{b=1} b\right) + \frac{N}{2N}\\ &= 1 + \frac{1}{2N} \cdot{} \sum^{N}_{b=1} b. \end{aligned} \]

Finally, we can apply Gauss' formula and simplify:

\[ \begin{aligned} \mathrm{E}[C_{\rm UL}] &= 1 + \frac{1}{2N} \cdot{} \frac{ ( N + 1 ) N }{2}\\ &= 1 + \frac{ N + 1 }{4}\\ &= 1 + \frac{N}{4} + 0.25\\ &= 1.25 + \frac{N}{4}\\&\approx \frac{N}{4}. \end{aligned} \]

So even though they have the same load factor, the expected number of comparisons for an unsuccessful search is

\( \Theta(1) \) in Table 1; but,
\( \Theta(n) \) in Table 2.

Key Idea: The complexity of search is impacted by clustering, not just by \( \lambda \) itself.

Primary Clustering is Self-Perpetuating

Worse still, clusters beget clusters!

The bigger a cluster, the more likely it is that a key will be hashed into it.
The more inserted keys that hash into the cluster, the bigger the cluster gets!
GOTO 1

So primary clustering is the main downside to linear probing.

Expected Complexity

Because of the effects of clustering, the expected complexity analysis for linear probing is complicated.

We won't go through it in this course, but here's the upshot.

Expected number of probes for successful lookup (key in the table):

\[ \mathrm{E}[C_{\rm SL}] \approx \frac{1}{2} \left( 1 + \frac{1}{ 1 - \lambda } \right) . \]

Expected number of probes for unsuccessful lookup (key not in the table): \[ \mathrm{E}[C_{\rm UL}] \approx \frac{1}{2} \left( 1 + \frac{1}{ ( 1 - \lambda )^{2} } \right) . \]

Gah!
Sorry to startle you! Don't worry; you don't need to memorize these formulae.
However, there are some practical implications that are important to know.

Complexity Observations

As \( \lambda \) approaches 1, the expected number of probes approaches infinity!
- Saying it's infinity is partly an artifact of using \( \lambda \), in reality it approaches \( N \).
- But it does capture the idea that as lambda gets closer to 1, clustering gets worse and worse. On the other hand, when there aren't many items in the hash table, there won't be much clustering at all.
- So, based on these insights, keep \( \lambda \) a fair bit below 1 (e.g., ensure \( \lambda \lt 0.5 \)) when using linear probing.
We can still dynamically resize our table to keep \( \lambda \) roughly constant. In that case…
- Lookup has expected \( \Theta(1) \) complexity.
- Insert has amortized expected \( \Theta(1) \) complexity.
However, the worst-case complexity of resizing is worse than it is for separate chaining.
- Each time we insert a key into the new table, it could take \( \Theta(k) \) time, where \( k \) is the number of keys moved so far.
- That means the worst case time of rehashing all \( n \) keys is \( \Theta( 1 + 2 + 3 + \cdots{} + n ) = \Theta(n^2) \).
  - The worst case time for insertion without resize is \( \Theta(n) \)
- As a result, for a dynamically resized table, these are true of insert
  - \( \Theta( n^2 ) \) worst case
  - \( \Theta(n) \) amortized worst case (because the resize happens so rarely)
- Note:
  - Remember that, with a decent hash function, the worst case is super rare anyway, so increasing the worst case time is not a huge deal.
  - There are clever things to do that can improve the complexity of rehashing but we won't cover them here.

(When logged in, completion status appears here.)

Previous Page Next Page

CS 70

Warning: Missing Prerequisites