CS 70

Fixing LOOKUP with Hash Functions

  • Duck speaking

    Wait a moment, we just saw that LOOKUP is \( \mathrm{O}(n) \) for an array-backed list. That's terrible! But I think I have an easy fix. Why don't we just sort the array and then use binary search? That'd be \( \mathrm{O}(\log n) \).

  • LHS Cow speaking

    We mentioned the reason when we introduced trees, but let's recap…

When we first started talking about structures to support search, we thought about doing it with an array. Arrays have a lot to offer, particularly really fast random access. That makes insert (at the back) really efficient!

The problem was LOOKUP. As we saw back then, and recapped on the previous page, if the keys are in no particular order, LOOKUP is worst-case \( \Theta(n) \). We also figured out that even the expected complexity of LOOKUP is \( \Theta(n) \)! So that's bad.

We then turned to the idea of sorting the keys. That would make LOOKUP \( \Theta(\log n) \) with binary search. But the question then becomes one of keeping our list sorted. If we do that in an array, then INSERT becomes \( \Theta(n) \) in the worst case, because we have to shift all the elements over to make room for the new one.

So what we did then was abandoned the idea of arrays and developed binary-search trees, which have \( \Theta(\log n) \) INSERT and LOOKUP in the worst case if we keep the tree balanced.

Today, we're going to follow a different path, instead of abandoning arrays, we're going to abandon the idea of keeping the keys sorted.

Here's where we're at for how to store a set in an array:

  • Putting the items in the array in the order they're inserted gives us \( \Theta(1) \) INSERT and \( \Theta(n) \) LOOKUP. So that order is no good.
  • Putting the items in the array in sorted order gives us \( \Theta(n) \) INSERT and \( \Theta(\log n) \) LOOKUP. So that order is no good either.

We need another way. We need to somehow “just know” where in the array the items so we can avoid blindly searching through the array in LOOKUP, or putting them in one place and moving them later.

  • Hedgehog speaking

    Is an “item” the same as a “key”?

  • LHS Cow speaking

    Yes, we're thinking of a set, so we're using the terms interchangeably here. If we were thinking of a map, we'd be talking about keys and values (and the items the array would store would be each be a pair of a key and a value).

Hash Functions

  • Dog speaking

    Is there a cool idea to find the right place to look in the array quickly?

  • LHS Cow speaking

    Yes! It's called a hash function. We'll sketch out the fundamental idea here, but and then refine it to fix a few flaws.

Big Idea: What if we had a function that could take a key and return a number we could use to find the right place in the array for that key?

We'll call that function \( \mathrm{H} \) and our array \( \mathit{arr} \).

  • If I want to insert a key, \( k \), I find the array index to store it at, \( i \), by saying \( i = \mathrm{H}(k) \), and then set the array element with \( {\mathit{arr}}[i] = k \).
  • If I want to check whether a key is present, I find the array index for the location we would have put it, \( i \), by saying \( i = \mathrm{H}(k) \), and then just look in \( \mathit{arr}[i] \) to see if it's there.

Assuming computing \( \mathrm{H}(k) \) is constant time, it takes \( \Theta(1) \) time for both INSERT and LOOKUP!

Terminology

  • This function is called a hash function.
  • We call the array that stores the keys a hash table.
  • A slot in the table is often called a bucket.
  • The function's output is called a hash value.
    • We use the hash value to get an index into the table to find a bucket.
  • Duck speaking

    Does the hash function know how big the array is so it can give values in the right range?

  • LHS Cow speaking

    Good point. Let's refine the idea so that the hash function gives an arbitrary integer and we'll transform the value afterwards to get an index value in the right range.

  • Rabbit speaking

    This last step is sometimes called the mappping function, since it's mapping the hash value to an index in the array.

Suppose the hash function gives values in the range 0..4294967295 (the range of an unsigned 32-bit integer). If we have an array of size 100, what should the mapping function do to the hash value to get an index in the array?

So, the most straightforward way to implement a hash table is to have an array of buckets, and a hash function use \( \mathrm{H}(k) \mod M \), where \( M \) is the number of buckets, to get an index into the array. In C++ code, it'd be something like arr[ hash(k) % array_size ].

  • Sheep speaking

    There's one thing I still don't get… Why is it called a hash function?

  • LHS Cow speaking

    Good question! The word “hash” is used in a lot of different contexts, but it's often used to mean “mix up” or “make a mess of”. In this case, we're taking a key and jumbling it up to get a hash value. We'll see an example of a hash function in a moment.

  • Alien speaking

    In another universe, they call it a scramble function.

  • Pig speaking

    These words all remind me of breakfast!

Example

Say we have the following hash function that takes a string and returns an int:


int hash(const string& key) {
    int sum = addASCIIValues(key);  // Adds ASCII values of characters
    int hash = (sum % 13) + (sum % 7) + (sum % 5); // Jumble more
    return hash;
}

Now imagine that we already have the following keys:

Key sum hashval
avocado 733 13
bacon 515 12
cheese 621 16
pancakes 838 14
sausage 745 7
steak 536 8
toast 555 11

If our table has 10 buckets, then we get the bucket for each key with hashval % 10:

bucket key
0  
1 toast
2 bacon
3 avocado
4 pancakes
5  
6 cheese
7 sausage
8 steak
9  

Here's a handy ASCII table. Use it to insert eggs.

What are the ASCII codes for 'e', 'g', 'g', and 's'? (Enter the values separated by a comma and a space, e.g., 111, 111, 112, 115.)

Okay, now our sumOfASCIIValues function gives us 101 + 103 + 103 + 115 = 422. But our hash function then scrambles that up a bit more, giving us 422 % 13 + 422 % 7 + 422 % 5.

What's 422 % 13

What's 422 % 7

What's 422 % 5

Okay, so the final hash value is the sum of those. So…

Which bucket should eggs go in? (Just enter the number of the bucket.)

In summary, inside our hash function

  • The sum of the ASCII values in the string eggs is 101 + 103 + 103 + 115 = 422.
  • The hash value is (422 % 13) + (422 % 7) + (422 % 5) = 10.

and then to map the hash value to a bucket, we use modulo 10

  • So the bucket is 10 % 10 = 0.
bucket key
0 eggs
1 toast
2 bacon
3 avocado
4 pancakes
5  
6 cheese
7 sausage
8 steak
9  

Now, whenever we want to check if a key is in the table, we can just figure out which bucket it belongs in and see if it's there!

If we look up something that's already in the table, it'll have the same hash value and we'll find it right there, in one step. So let's try looking up a couple of things that aren't in the table…

key sum hash
beans 521 5
waffles 744 9

We can see we'd put them in buckets 5 and 9 respectively, but they're not there, so they aren't in the set. No searching through the whole array needed!

  • Dog speaking

    Holy cow, we did it! \( \Theta(1) \) INSERT and LOOKUP! Goodbye BSTs! Lesson over, everyone!

  • Horse speaking

    Woah, woah! I have a natural sense for the odds of things, and it feels like this race has been fixed somehow. We've been too lucky.

  • Sheep speaking

    Yes, I get the feeling someone is trying to pull the wool over my eyes.

What do you think? Does something about how all our items ended up in buckets feel implausible in some way? What have we overlooked?

If keys mapped to buckets randomly, the chance of everything mapping to a distinct bucket would be

\[ 1 \times \frac{9}{10} \times \frac{8}{10} \times \frac{7}{10} \times \frac{6}{10} \times \frac{5}{10} \times \frac{4}{10} \times \frac{3}{10} \times \frac{2}{10} \times \frac{1}{10} = \frac{567}{1562500} \approx 0.00036288 \]

  • LHS Cow speaking

    So, it does seem incredibly lucky that all the foods we put in our set (or looked up) happened to be in distinct buckets.

  • RHS Cow speaking

    But it wasn't luck at all. Behind the scenes, we rigged it, by carefully choosing the foods on the list and the constants in hash function.

  • Horse speaking

    Told you so. That bit where the claimed they were jumbling everything up more was just a smokescreen for fixing the race!

  • Hedgehog speaking

    Oh…

Perfect Hash Functions

A hash function that maps all the data that'll be stored in the hash table to a unique bucket is known as a perfect hash function. It's only possible if you know all the keys you'll be storing in your table in advance (so you can cook up a suitable function).

  • LHS Cow speaking

    Normally, of course, we don't know the keys in advance, so we can't make a perfect hash function. That means we need to deal with two keys wanting to be in the same bucket, which is known as a collision.

  • RHS Cow speaking

    And that's our next topic…

(When logged in, completion status appears here.)