Fixing LOOKUP with Hash Functions
Wait a moment, we just saw that LOOKUP is \( \mathrm{O}(n) \) for an array-backed list. That's terrible! But I think I have an easy fix. Why don't we just sort the array and then use binary search? That'd be \( \mathrm{O}(\log n) \).
We mentioned the reason when we introduced trees, but let's recap…
When we first started talking about structures to support search, we thought about doing it with an array. Arrays have a lot to offer, particularly really fast random access. That makes insert (at the back) really efficient!
The problem was LOOKUP. As we saw back then, and recapped on the previous page, if the keys are in no particular order, LOOKUP is worst-case \( \Theta(n) \). We also figured out that even the expected complexity of LOOKUP is \( \Theta(n) \)! So that's bad.
We then turned to the idea of sorting the keys. That would make LOOKUP \( \Theta(\log n) \) with binary search. But the question then becomes one of keeping our list sorted. If we do that in an array, then INSERT becomes \( \Theta(n) \) in the worst case, because we have to shift all the elements over to make room for the new one.
So what we did then was abandoned the idea of arrays and developed binary-search trees, which have \( \Theta(\log n) \) INSERT and LOOKUP in the worst case if we keep the tree balanced.
Today, we're going to follow a different path, instead of abandoning arrays, we're going to abandon the idea of keeping the keys sorted.
Here's where we're at for how to store a set in an array:
- Putting the items in the array in the order they're inserted gives us \( \Theta(1) \) INSERT and \( \Theta(n) \) LOOKUP. So that order is no good.
- Putting the items in the array in sorted order gives us \( \Theta(n) \) INSERT and \( \Theta(\log n) \) LOOKUP. So that order is no good either.
We need another way. We need to somehow “just know” where in the array the items so we can avoid blindly searching through the array in LOOKUP, or putting them in one place and moving them later.
Is an “item” the same as a “key”?
Yes, we're thinking of a set, so we're using the terms interchangeably here. If we were thinking of a map, we'd be talking about keys and values (and the items the array would store would be each be a pair of a key and a value).
Hash Functions
Is there a cool idea to find the right place to look in the array quickly?
Yes! It's called a hash function. We'll sketch out the fundamental idea here, but and then refine it to fix a few flaws.
Big Idea: What if we had a function that could take a key and return a number we could use to find the right place in the array for that key?
We'll call that function \( \mathrm{H} \) and our array \( \mathit{arr} \).
- If I want to insert a key, \( k \), I find the array index to store it at, \( i \), by saying \( i = \mathrm{H}(k) \), and then set the array element with \( {\mathit{arr}}[i] = k \).
- If I want to check whether a key is present, I find the array index for the location we would have put it, \( i \), by saying \( i = \mathrm{H}(k) \), and then just look in \( \mathit{arr}[i] \) to see if it's there.
Assuming computing \( \mathrm{H}(k) \) is constant time, it takes \( \Theta(1) \) time for both INSERT and LOOKUP!
Terminology
- This function is called a hash function.
- We call the array that stores the keys a hash table.
- A slot in the table is often called a bucket.
- The function's output is called a hash value.
- We use the hash value to get an index into the table to find a bucket.
Does the hash function know how big the array is so it can give values in the right range?
Good point. Let's refine the idea so that the hash function gives an arbitrary integer and we'll transform the value afterwards to get an index value in the right range.
This last step is sometimes called the mappping function, since it's mapping the hash value to an index in the array.
So, the most straightforward way to implement a hash table is to have an array of buckets, and a hash function use \( \mathrm{H}(k) \mod M \), where \( M \) is the number of buckets, to get an index into the array. In C++ code, it'd be something like arr[ hash(k) % array_size ]
.
There's one thing I still don't get… Why is it called a hash function?
Good question! The word “hash” is used in a lot of different contexts, but it's often used to mean “mix up” or “make a mess of”. In this case, we're taking a key and jumbling it up to get a hash value. We'll see an example of a hash function in a moment.
In another universe, they call it a scramble function.
These words all remind me of breakfast!
Example
Say we have the following hash function that takes a string
and returns an int
:
int hash(const string& key) {
int sum = addASCIIValues(key); // Adds ASCII values of characters
int hash = (sum % 13) + (sum % 7) + (sum % 5); // Jumble more
return hash;
}
Now imagine that we already have the following keys:
Key | sum | hashval |
---|---|---|
avocado | 733 | 13 |
bacon | 515 | 12 |
cheese | 621 | 16 |
pancakes | 838 | 14 |
sausage | 745 | 7 |
steak | 536 | 8 |
toast | 555 | 11 |
If our table has 10 buckets, then we get the bucket for each key with hashval % 10
:
bucket | key |
---|---|
0 | |
1 | toast |
2 | bacon |
3 | avocado |
4 | pancakes |
5 | |
6 | cheese |
7 | sausage |
8 | steak |
9 |
Here's a handy ASCII table. Use it to insert eggs
.
Okay, now our sumOfASCIIValues
function gives us 101 + 103 + 103 + 115 = 422. But our hash function then scrambles that up a bit more, giving us 422 % 13 + 422 % 7 + 422 % 5.
Okay, so the final hash value is the sum of those. So…
In summary, inside our hash function
- The sum of the ASCII values in the string
eggs
is 101 + 103 + 103 + 115 = 422. - The hash value is (422 % 13) + (422 % 7) + (422 % 5) = 10.
and then to map the hash value to a bucket, we use modulo 10
- So the bucket is 10 % 10 = 0.
bucket | key |
---|---|
0 | eggs |
1 | toast |
2 | bacon |
3 | avocado |
4 | pancakes |
5 | |
6 | cheese |
7 | sausage |
8 | steak |
9 |
Now, whenever we want to check if a key is in the table, we can just figure out which bucket it belongs in and see if it's there!
If we look up something that's already in the table, it'll have the same hash value and we'll find it right there, in one step. So let's try looking up a couple of things that aren't in the table…
key | sum | hash |
---|---|---|
beans | 521 | 5 |
waffles | 744 | 9 |
We can see we'd put them in buckets 5 and 9 respectively, but they're not there, so they aren't in the set. No searching through the whole array needed!
Holy cow, we did it! \( \Theta(1) \) INSERT and LOOKUP! Goodbye BSTs! Lesson over, everyone!
Woah, woah! I have a natural sense for the odds of things, and it feels like this race has been fixed somehow. We've been too lucky.
Yes, I get the feeling someone is trying to pull the wool over my eyes.
If keys mapped to buckets randomly, the chance of everything mapping to a distinct bucket would be
\[ 1 \times \frac{9}{10} \times \frac{8}{10} \times \frac{7}{10} \times \frac{6}{10} \times \frac{5}{10} \times \frac{4}{10} \times \frac{3}{10} \times \frac{2}{10} \times \frac{1}{10} = \frac{567}{1562500} \approx 0.00036288 \]
So, it does seem incredibly lucky that all the foods we put in our set (or looked up) happened to be in distinct buckets.
But it wasn't luck at all. Behind the scenes, we rigged it, by carefully choosing the foods on the list and the constants in hash function.
Told you so. That bit where the claimed they were jumbling everything up more was just a smokescreen for fixing the race!
Oh…
Perfect Hash Functions
A hash function that maps all the data that'll be stored in the hash table to a unique bucket is known as a perfect hash function. It's only possible if you know all the keys you'll be storing in your table in advance (so you can cook up a suitable function).
Normally, of course, we don't know the keys in advance, so we can't make a perfect hash function. That means we need to deal with two keys wanting to be in the same bucket, which is known as a collision.
And that's our next topic…
(When logged in, completion status appears here.)