Expected Complexity of Separate Chaining
Remember that you can't calculate an expected complexity unless you have a probability distribution over cases! It's very common that we ourselves make a probabilistic assumption about the distribution of inputs or other factors as a way to model the behavior of the structure.
Assumption: We have a good hash function for our key distribution; keys are distributed uniformly across all the buckets. (Statistically, it looks like a uniform random distribution.)
What if we don't have such a good hash function?
Then this analysis doesn't apply! We are finding the expected complexity for a good hash function. We're not making any promises about a bad hash function.
We'll talk more about what makes a good hash function at the end of this lesson.
How Likely Is the Worst Case?
Under our assumption, how likely are we to put all the items in the same bucket?
We'll answer this question in stages…
Chance of Putting an Item in First Bucket
First, let's imagine that we have one item we're adding to our table, what's the chance it'll end up in the very first bucket?
We'll specify that there are a total of \( M \) buckets in the table.
There are \( M \) buckets, and they're all equally likely, so \[ \mathrm{P}(\text{bucket 0}) = \frac{1}{M} \]
Chance of Putting All Items in First Bucket
Now, let's suppose we've put \( N \) items into our hash table. What is the probability that they all the items ended up in the first bucket?
There are \( N \) items, and they all need to end up in the first bucket, so that's \[ \begin{align} \mathrm{P}(N \; \text{items in bucket 0}) &= \underbrace{\mathrm{P}(\text{bucket 0}) \times \mathrm{P}(\text{bucket 0}) \times \cdots \times \mathrm{P}(\text{bucket 0})}_{N \; \text{terms}} \\ &= \mathrm{P}(\text{bucket 0})^N \\ &= \left(\frac{1}{M}\right)^N \end{align} \]
So if we had 20 buckets and 10 items, the chance that they would all end up in the first bucket is 1 in 10,240,000,000,000—more than ten trillion to one.
Chance of Putting All Items in the Same Bucket
It's just as bad if they all end up in bucket 1 or bucket 2, right?
Indeed. It's the same probability that each of those happens, and they're all mutually exclusive possibilities.
So we have
\[ \begin{align} \mathrm{P}(N \; \text{items in same bucket}) &= \mathrm{P}(N \; \text{items in bucket 0}) + \cdots + \mathrm{P}(N \; \text{items in bucket} \; M-1) \\ &= \underbrace{\mathrm{P}(N \; \text{items in bucket 0}) + \cdots + \mathrm{P}(N \; \text{items in bucket 0})}_{M \; \text{terms}} \\ &= M \, \mathrm{P}(N \; \text{items in bucket 0}) \\ &= M \left(\frac{1}{M}\right)^N \\ &= \left(\frac{1}{M}\right)^{N-1} \end{align}. \]
That's a slightly less tiny chance, but still very tiny!
Thus we can say that there is a vanishingly small chance of putting all the items in one bucket.
Of course, putting all but one item in the same bucket would be nearly as bad. And all but two wouldn't be great, either. So it is really important that we figure out what the expected complexity is to figure out what is typical.
Load Factor
The expected cost is going to depend on how full the table is. If the number of buckets is large in comparison to the number of items, we generally expect fewer collisions to occur.
Load Factor: We defined the load factor, \( \lambda \), as follows:
\[ \lambda = \frac{N}{M} \]
where \( N \) is the number of items we have in the table and \( M \) is the number of buckets.
The load factor measures how full the table is. It is also the average number of items per bucket.
The load factor will show up over and over in our analyses.
Unsuccessful LOOKUP
Let's first examine the case where we look for a key that isn't in the table.
Intuition: If we spread the keys evenly, we expect each bucket to have \( \lambda = \frac{N}{M} \) keys. In an unsuccessful search, we pick one bucket and look at every key. So the expected number of keys to look at is \( \lambda \). It's large if the table is overfull, small if the table has lots of room.
Meh. I believe it. Let's move on.
We'd better actually do the math to make sure!
Oh, no…
Math: First we need to use the hash function to find the bucket where the key ought to be. We'll assume that our hash function takes \( \Theta(1) \) time to compute, so that's no big deal. The expected number of comparisons is
\[ \mathrm{E}[C_\mathrm{UL}] = \sum_{b = 1}^{M} \mathrm{Pr}(b) C(b), \]
where \( \mathrm{Pr}(b) \) is the probability of the hash function telling us to look in bucket \( b \), and \( C(b) \) is the number of comparisons needed to search the keys in bucket \( b \).
We can fill in some details:
- Since our hash function spreads keys uniformly, we know that \( \mathrm{Pr}(b) = \frac{1}{M} \).
- Not finding a key in a chain means visiting every item in the bucket. So \( C(b) = K(b) \), where \( K(b) \) is the number of items in bucket \( b \).
So really we have
\[ \mathrm{E}[ C_\mathrm{UL} ] = \sum_{b = 1}^{M} \frac{1}{M} K(b) \ = \ \frac{1}{M} \sum_{b = 1}^{M} K(b). \]
In the sum we are adding up the number of items in each of the buckets. All together, that's… all of the items in the table! So it finally comes out to
\[ \mathrm{E}[ C_\mathrm{UL} ] \ = \ \frac{1}{M} \sum_{b = 1}^{M} K(b) \ = \ \frac{1}{M}N = \lambda \]
Meh, did you really need you really need all that dancing around with summations to tell us that the cost will average out to the average number of items per bucket?
…
So yes, the expected number of comparisons is \( \lambda \), the average number of items per bucket (\( N/M \)). Let's highlight that:
- The expected number of comparisons in an unsuccessful search is \(\lambda\).
Would you rather go over this argument in a video? We've got you covered!
If you think you've got it, then you can skip the video.
Wow! Our intuition was right on the money!
Yup! It's nice when that works out.
Successful LOOKUP
Now let's examine the case where our search is successful; that is, where we find the item we are looking for.
Additional Assumption: We are equally likely to search for any item in the table.
Intuition: Again, we expect each bucket to have \( \lambda = \frac{N}{M} \) items. So where do we expect the item to be? When we analyzed linear search through a list, we found that the item's expected position is halfway through the list. So we expect the location of our item to come out as \( \frac{\lambda}{2} \).
Okay, let's do this formally and see if our intuition holds up! (Spoiler Alert: Almost, but not quite.)
Math: The expected number of comparisons can be written as—
Excuse me, I think you're going to do a lot of complicated summations. Can I suggest an easier way?
Hi, there, Anak Yodpinyanee!
Several years ago when we were teaching this material, you pointed out a much easier way to figure out the answer than the one we were teaching, or even than what you'll find in most discussions of hash tables.
Oh, thank heaven!
Okay, let's explain things Anak's way!
Here's what Anak observed:
\[ \mathrm{E}[ C_{\mathrm{SL}} ] = \mathrm{E}[ T_{\mathrm{SL}} ] + \mathrm{E}[ F_{\mathrm{SL}} ] \]
where \( C_{\mathrm{SL}} \) is the number of comparisons in a successful LOOKUP; \( T_{\mathrm{SL}} \) is the number of those comparisons that evaluate to true; and \( F_{\mathrm{SL}} \) is the number of comparisons that evaluate to false.
Hay! Wait a minute, there will always be exactly one true comparison, when we find the item we're looking for!
Indeed, so \( T_{\mathrm{SL}} = 1 \). It never changes, because by definition, in a successful LOOKUP, we will find what we're looking for, and then stop.
So, now we just have to work out \( \mathrm{E}[ F_{\mathrm{SL}} ] \).
To help us get started, let's think about the two extremes, the best case and the worst case. How many false comparisons could we have before we find the item we're looking for in the best and worst cases?
The best case is when it's right there and we don't do any.
And the worst case is when all the other items are in our way; that's \( N - 1 \).
Exactly.
And thinking about those other items that might be in our way helps us figure out the expected number of false comparisons.
We know there are \( N - 1 \) other (unwanted) items besides the one we're going to find, each of which might be “in our way” before we get to the item we're looking for.
To be one of our false comparisons, an unwanted item needs to be
- In our bucket (rather than in one of the other buckets)
- 1 in \( M \) chance it's in our bucket; and,
- Before our item in the list, so we need to pass it to get to our item
- 50% chance it's before our item rather than after.
And with that, we know \( \mathrm{E} [ F_{\mathrm{SL}} ] \), because to be one of our false comparisons, the unwanted item needs to both be in our bucket and be in front of the one we're looking for.
\[ \begin{align} \mathrm{E}[F_{\rm SL}] &= \sum_{k = 1}^{N-1} \mathrm{P}(\text{in our bucket}) \times \mathrm{P}(\text{in the way}) \\ &= (N-1) \times \mathrm{P}(\text{in our bucket}) \times \mathrm{P}(\text{in the way}) \\ & = (N-1) \left(\frac{1}{M}\right)\left(\frac{1}{2}\right) \\ & = \frac{N-1}{2M} \\ & = \frac{\lambda}{2}-\frac{1}{2M} \\ & \approx \frac{\lambda}{2} \end{align} \]
Adding in our final true comparison when we find the item, our total number of expected comparisons is
\[ \mathrm{E}[C_\mathrm{SL}] \approx 1 + \frac{\lambda}{2}. \]
Why did you drop the \( \frac{1}{2M} \) term?
Because as \( M \) gets larger, it quickly becomes so small as to be irrelevant.
Thus
- The expected number of comparisons in a successful search is (approximately) \( 1 + \lambda/{2} \).
Actually, that wasn't so bad…
I want MORE explanations!! I want the complicated way with all the summations!
Well, if you really want the traditional explanation with more summations, Prof. Talvitie made a video… (It's optional!)
She made it before she heard about Anak's method!
Thank you, Anak!
You're welcome!
This optional video provides the “traditional” explanation:
I don't understand where our intuition that it would be \( \lambda/{2} \) went wrong, though.
Well, remember that we are analyzing a successful LOOKUP. So we know the key we are looking for is in the table.
Imagine a table with 100 buckets and one item. How many comparisons does a successful LOOKUP take? One, right? It's one, every time.
But if we just said \( \frac{\lambda}{2} \), then that would be \( 1/200 \). That can't be right! We have to have at least one comparison to find our item.
In this case the full expression we derived is exactly right: \( 1 + 1/200 - 1/200 = 1 \) comparison. Nice!
Because we know the key is in the table it's a little more wiggly than our intuitive argument, but we got the big idea right.
Summary
For a separate chaining table with \( N \) items and \( M \) buckets, let \( \lambda = \frac{N}{M} \). Then,
- The expected number of comparisons during an unsuccessful LOOKUP is \( \lambda \); and,
- The expected number of comparisons during a successful LOOKUP is approximately \( 1 + \frac{\lambda}{2} \).
Roughly speaking, the more full the table gets (more items per bucket), the longer we expect LOOKUP to take.
(When logged in, completion status appears here.)