CS 70

Expected Complexity of Separate Chaining

Remember that you can't calculate an expected complexity unless you have a probability distribution over cases! It's very common that we ourselves make a probabilistic assumption about the distribution of inputs or other factors as a way to model the behavior of the structure.

Assumption: We have a good hash function for our key distribution; keys are distributed uniformly across all the buckets. (Statistically, it looks like a uniform random distribution.)

  • Duck speaking

    What if we don't have such a good hash function?

  • LHS Cow speaking

    Then this analysis doesn't apply! We are finding the expected complexity for a good hash function. We're not making any promises about a bad hash function.

  • RHS Cow speaking

    We'll talk more about what makes a good hash function at the end of this lesson.

How Likely Is the Worst Case?

Under our assumption, how likely are we to put all the items in the same bucket?

We'll answer this question in stages…

Chance of Putting an Item in First Bucket

First, let's imagine that we have one item we're adding to our table, what's the chance it'll end up in the very first bucket?

We'll specify that there are a total of \( M \) buckets in the table.

Under our assumption, what is the probability that an item is placed in bucket 0? (Write your answer as a fraction.)

There are \( M \) buckets, and they're all equally likely, so \[ \mathrm{P}(\text{bucket 0}) = \frac{1}{M} \]

Chance of Putting All Items in First Bucket

Now, let's suppose we've put \( N \) items into our hash table. What is the probability that they all the items ended up in the first bucket?

Under our assumption, what is the probability of all \( N \) items being placed in bucket 0? (You can use ^ to indicate exponentiation.)

There are \( N \) items, and they all need to end up in the first bucket, so that's \[ \begin{align} \mathrm{P}(N \; \text{items in bucket 0}) &= \underbrace{\mathrm{P}(\text{bucket 0}) \times \mathrm{P}(\text{bucket 0}) \times \cdots \times \mathrm{P}(\text{bucket 0})}_{N \; \text{terms}} \\ &= \mathrm{P}(\text{bucket 0})^N \\ &= \left(\frac{1}{M}\right)^N \end{align} \]

So if we had 20 buckets and 10 items, the chance that they would all end up in the first bucket is 1 in 10,240,000,000,000—more than ten trillion to one.

Chance of Putting All Items in the Same Bucket

  • Duck speaking

    It's just as bad if they all end up in bucket 1 or bucket 2, right?

  • LHS Cow speaking

    Indeed. It's the same probability that each of those happens, and they're all mutually exclusive possibilities.

So we have

\[ \begin{align} \mathrm{P}(N \; \text{items in same bucket}) &= \mathrm{P}(N \; \text{items in bucket 0}) + \cdots + \mathrm{P}(N \; \text{items in bucket} \; M-1) \\ &= \underbrace{\mathrm{P}(N \; \text{items in bucket 0}) + \cdots + \mathrm{P}(N \; \text{items in bucket 0})}_{M \; \text{terms}} \\ &= M \, \mathrm{P}(N \; \text{items in bucket 0}) \\ &= M \left(\frac{1}{M}\right)^N \\ &= \left(\frac{1}{M}\right)^{N-1} \end{align}. \]

That's a slightly less tiny chance, but still very tiny!

Thus we can say that there is a vanishingly small chance of putting all the items in one bucket.

  • RHS Cow speaking

    Of course, putting all but one item in the same bucket would be nearly as bad. And all but two wouldn't be great, either. So it is really important that we figure out what the expected complexity is to figure out what is typical.

Load Factor

The expected cost is going to depend on how full the table is. If the number of buckets is large in comparison to the number of items, we generally expect fewer collisions to occur.

Load Factor: We defined the load factor, \( \lambda \), as follows:

\[ \lambda = \frac{N}{M} \]

where \( N \) is the number of items we have in the table and \( M \) is the number of buckets.

The load factor measures how full the table is. It is also the average number of items per bucket.

  • LHS Cow speaking

    The load factor will show up over and over in our analyses.

Unsuccessful LOOKUP

Let's first examine the case where we look for a key that isn't in the table.

Intuition: If we spread the keys evenly, we expect each bucket to have \( \lambda = \frac{N}{M} \) keys. In an unsuccessful search, we pick one bucket and look at every key. So the expected number of keys to look at is \( \lambda \). It's large if the table is overfull, small if the table has lots of room.

  • Goat speaking

    Meh. I believe it. Let's move on.

  • LHS Cow speaking

    We'd better actually do the math to make sure!

  • Hedgehog speaking

    Oh, no…

Math: First we need to use the hash function to find the bucket where the key ought to be. We'll assume that our hash function takes \( \Theta(1) \) time to compute, so that's no big deal. The expected number of comparisons is

\[ \mathrm{E}[C_\mathrm{UL}] = \sum_{b = 1}^{M} \mathrm{Pr}(b) C(b), \]

where \( \mathrm{Pr}(b) \) is the probability of the hash function telling us to look in bucket \( b \), and \( C(b) \) is the number of comparisons needed to search the keys in bucket \( b \).

We can fill in some details:

  • Since our hash function spreads keys uniformly, we know that \( \mathrm{Pr}(b) = \frac{1}{M} \).
  • Not finding a key in a chain means visiting every item in the bucket. So \( C(b) = K(b) \), where \( K(b) \) is the number of items in bucket \( b \).

So really we have

\[ \mathrm{E}[ C_\mathrm{UL} ] = \sum_{b = 1}^{M} \frac{1}{M} K(b) \ = \ \frac{1}{M} \sum_{b = 1}^{M} K(b). \]

In the sum we are adding up the number of items in each of the buckets. All together, that's… all of the items in the table! So it finally comes out to

\[ \mathrm{E}[ C_\mathrm{UL} ] \ = \ \frac{1}{M} \sum_{b = 1}^{M} K(b) \ = \ \frac{1}{M}N = \lambda \]

  • Goat speaking

    Meh, did you really need you really need all that dancing around with summations to tell us that the cost will average out to the average number of items per bucket?

  • LHS Cow speaking

So yes, the expected number of comparisons is \( \lambda \), the average number of items per bucket (\( N/M \)). Let's highlight that:

  • The expected number of comparisons in an unsuccessful search is \(\lambda\).
  • RHS Cow speaking

    Would you rather go over this argument in a video? We've got you covered!

  • LHS Cow speaking

    If you think you've got it, then you can skip the video.

  • Dog speaking

    Wow! Our intuition was right on the money!

  • LHS Cow speaking

    Yup! It's nice when that works out.

Successful LOOKUP

Now let's examine the case where our search is successful; that is, where we find the item we are looking for.

Additional Assumption: We are equally likely to search for any item in the table.

Intuition: Again, we expect each bucket to have \( \lambda = \frac{N}{M} \) items. So where do we expect the item to be? When we analyzed linear search through a list, we found that the item's expected position is halfway through the list. So we expect the location of our item to come out as \( \frac{\lambda}{2} \).

Okay, let's do this formally and see if our intuition holds up! (Spoiler Alert: Almost, but not quite.)

Math: The expected number of comparisons can be written as—

  • Sherlock speaking

    Excuse me, I think you're going to do a lot of complicated summations. Can I suggest an easier way?

  • LHS Cow speaking

    Hi, there, Anak Yodpinyanee!

  • RHS Cow speaking

    Several years ago when we were teaching this material, you pointed out a much easier way to figure out the answer than the one we were teaching, or even than what you'll find in most discussions of hash tables.

  • Hedgehog speaking

    Oh, thank heaven!

  • LHS Cow speaking

    Okay, let's explain things Anak's way!

Here's what Anak observed:

\[ \mathrm{E}[ C_{\mathrm{SL}} ] = \mathrm{E}[ T_{\mathrm{SL}} ] + \mathrm{E}[ F_{\mathrm{SL}} ] \]

where \( C_{\mathrm{SL}} \) is the number of comparisons in a successful LOOKUP; \( T_{\mathrm{SL}} \) is the number of those comparisons that evaluate to true; and \( F_{\mathrm{SL}} \) is the number of comparisons that evaluate to false.

  • Horse speaking

    Hay! Wait a minute, there will always be exactly one true comparison, when we find the item we're looking for!

Indeed, so \( T_{\mathrm{SL}} = 1 \). It never changes, because by definition, in a successful LOOKUP, we will find what we're looking for, and then stop.

So, now we just have to work out \( \mathrm{E}[ F_{\mathrm{SL}} ] \).

To help us get started, let's think about the two extremes, the best case and the worst case. How many false comparisons could we have before we find the item we're looking for in the best and worst cases?

  • Dog speaking

    The best case is when it's right there and we don't do any.

  • Cat speaking

    And the worst case is when all the other items are in our way; that's \( N - 1 \).

  • LHS Cow speaking

    Exactly.

And thinking about those other items that might be in our way helps us figure out the expected number of false comparisons.

We know there are \( N - 1 \) other (unwanted) items besides the one we're going to find, each of which might be “in our way” before we get to the item we're looking for.

To be one of our false comparisons, an unwanted item needs to be

  • In our bucket (rather than in one of the other buckets)
    • 1 in \( M \) chance it's in our bucket; and,
  • Before our item in the list, so we need to pass it to get to our item
    • 50% chance it's before our item rather than after.

And with that, we know \( \mathrm{E} [ F_{\mathrm{SL}} ] \), because to be one of our false comparisons, the unwanted item needs to both be in our bucket and be in front of the one we're looking for.

\[ \begin{align} \mathrm{E}[F_{\rm SL}] &= \sum_{k = 1}^{N-1} \mathrm{P}(\text{in our bucket}) \times \mathrm{P}(\text{in the way}) \\ &= (N-1) \times \mathrm{P}(\text{in our bucket}) \times \mathrm{P}(\text{in the way}) \\ & = (N-1) \left(\frac{1}{M}\right)\left(\frac{1}{2}\right) \\ & = \frac{N-1}{2M} \\ & = \frac{\lambda}{2}-\frac{1}{2M} \\ & \approx \frac{\lambda}{2} \end{align} \]

Adding in our final true comparison when we find the item, our total number of expected comparisons is

\[ \mathrm{E}[C_\mathrm{SL}] \approx 1 + \frac{\lambda}{2}. \]

  • Duck speaking

    Why did you drop the \( \frac{1}{2M} \) term?

  • LHS Cow speaking

    Because as \( M \) gets larger, it quickly becomes so small as to be irrelevant.

Thus

  • The expected number of comparisons in a successful search is (approximately) \( 1 + \lambda/{2} \).
  • Hedgehog speaking

    Actually, that wasn't so bad…

  • Pig speaking

    I want MORE explanations!! I want the complicated way with all the summations!

  • LHS Cow speaking

    Well, if you really want the traditional explanation with more summations, Prof. Talvitie made a video… (It's optional!)

  • RHS Cow speaking

    She made it before she heard about Anak's method!

  • Dog speaking

    Thank you, Anak!

  • Sherlock speaking

    You're welcome!

This optional video provides the “traditional” explanation:

  • Duck speaking

    I don't understand where our intuition that it would be \( \lambda/{2} \) went wrong, though.

  • LHS Cow speaking

    Well, remember that we are analyzing a successful LOOKUP. So we know the key we are looking for is in the table.

  • RHS Cow speaking

    Imagine a table with 100 buckets and one item. How many comparisons does a successful LOOKUP take? One, right? It's one, every time.

  • LHS Cow speaking

    But if we just said \( \frac{\lambda}{2} \), then that would be \( 1/200 \). That can't be right! We have to have at least one comparison to find our item.

  • RHS Cow speaking

    In this case the full expression we derived is exactly right: \( 1 + 1/200 - 1/200 = 1 \) comparison. Nice!

  • LHS Cow speaking

    Because we know the key is in the table it's a little more wiggly than our intuitive argument, but we got the big idea right.

Summary

For a separate chaining table with \( N \) items and \( M \) buckets, let \( \lambda = \frac{N}{M} \). Then,

  • The expected number of comparisons during an unsuccessful LOOKUP is \( \lambda \); and,
  • The expected number of comparisons during a successful LOOKUP is approximately \( 1 + \frac{\lambda}{2} \).

Roughly speaking, the more full the table gets (more items per bucket), the longer we expect LOOKUP to take.

(When logged in, completion status appears here.)