Hash Function Properties
Earlier, when we were doing the expected-time analysis, we made some assumptions about our hash functions. Can we go over that?
Yes, and learn MORE about what makes a good hash function?
Okay!
Say we have a table with 100 buckets that stores English words as strings. Here are some candidate hash functions. What do you think?
return key[0]; // First character
return key.size(); // Length of the string
return sumOfLetters(key); // Ignore punctuation and case, sum the letter values (a=1, b=2, etc.)
return rand(); // Random number
None of these are very good choices.
First Character: This option is very susceptible to patterns in our input. If the strings in our table are English-language words, there are some letters that are far more likely to occur at the beginning of words than others. The result could be that some chains are really long and others are really short.
Length of the String: We have 100 buckets, but if our strings are English-language words they are going to clump up at the small end of the table and we won't use all the buckets! Even pneumonoultramicroscopicsilicovolcanoconiosis
only has 45 letters.
Sum of Letters: This option (which is a lot like the sumOfASCIIvalues
function we used earlier) is not quite as obviously a bad idea, but it still has problems. For instance, the order of characters doesn't matter at all. So cause
and sauce
would both be hashed to the same place. Furthermore, you can add one to one character and subtract one from another and keep the same hash value. For example, if we add one to c
(to get d
) and subtract one from u
(to get t
), we now have dates
and sated
hashing to that same position. You can keep doing substitutions, and even merge or split letters to get additional collisions with words like fry
and abdicated
. It's also the case that shorter (more common) words will tend to cluster near the smaller buckets— English-language words tend to cluster around particular hash values.
Random Number: The appealing thing about this approach would be that we would know that our strings would be spread evenly throughout the whole table! So what's wrong? If the hash function gives out random numbers, then we'll never be able to find anything!
Some Required and Desirable Properties of Hash Functions
- Deterministic: For a given key, the hash function must always give the same hash value—otherwise we won't be able to find keys we've inserted!
- Uniform: It should spread keys evenly in the table (patterns in input don't result in patterns in output)—avoids long chains.
- Avalanche Effect: Ideally, similar keys should have very different hash values.
- It's called the “avalanche effect” because we want a single-bit change in the input data to cause an “avalanche” that flips about 50% of the bits in the hash.
- To increase the odds of that happening, it's best if the hash function uses all input data.
- Efficient to Compute: The hash function should be efficient so that insert and lookup are efficient.
- But other important properties shouldn't be sacrificed in the name of efficiency.
Is the avalanche effect the same thing as uniform?
They are very related. Uniformity is what we ultimately want, but having good avalanche effect is part of how we get there.
In lots of problems we expect keys to have similarities/regularities, and maybe keys in one application (like student records) will be more similar to each other than they are to keys for a different application (like serial numbers).
So the avalanche effect is a generally useful principle that helps make the distribution more uniform across lots of applications.
An Example “Good” Hash Function for Strings
There are lots of better hash functions out there. For instance, here is a widely used one, djb2
(so called because of its author, Dan Bernstein):
size_t djb2(const std::string& key) {
constexpr size_t PRIME_INIT = 5381;
constexpr size_t PRIME_MULT = 33;
size_t hash = PRIME_INIT;
for (char c : key)
hash = hash * PRIME_MULT + c;
}
return hash;
}
Uh… why is this a good hash function?
This process of repeatedly multiplying and shifting the hash value is similar to a type of pseudorandom number generator (PRNG) called a Linear Congruential Generator. Those operations help to spread the hash values out and give similar keys very different hash values.
And why does it have those magic numbers?
Magic numbers are bad, right?
Yes, in general, but magic numbers are pretty common in hash functions. Both 33 and 5381 were carefully chosen.
In general, the design of hash functions mixes mathematical ideas with large doses of empiricism.
That said, we showed you djb2
in part because it is quite simple. In C++, you'd probably just use std::hash<std::string>
to hash strings.
Curiously perhaps,
std::hash<std::string>
is a class, not a function. You create an object of this class, and then call it like it was a function.
Here's a comparison of djb2
against std::hash<std::string>
(using the standard library from our Docker image—other standard libraries will use different implementations):
djb2 std::hash<std::string>
"CS 70 is fun!" cff011b09c8ee528 f1dbef56ef1d76f9
"CS 60 is fun!" cfefe77a98845407 582ea8887b7608ba
"CS 5 is fun!" b8b9afde2e1448b6 eaa596f1a888de31
"apple" 000000310f199e37 bf304ea8f09bf7b0
"Apple" 000000310cd68e17 ca599938d987eb95
"100" 000000000b8789b6 57fd0d0c6011a336
"101" 000000000b8789b7 14310e04694ccefb
"?" 000000000002b5e4 4a4bc9e3b003a81d
"!" 000000000002b5c6 746009e5abda5534
djb2
doesn't look quite as random.Yes, it's not really getting that “avalanche” property, is it?
True. But for most hash-table implementations, it's good enough.
Notice that the results in both cases are 64-bit integers (shown in hex).
When we use the value from a hash function, we wrap it modulo the number of buckets. Here's the results of doing that for two possible table sizes, 512 (a power of two) and 509 (a prime).
djb2 std::hash<std::string>
% 512 % 509 % 512 % 509
"CS 70 is fun!" 296 229 249 390
"CS 60 is fun!" 7 400 186 19
"CS 5 is fun!" 182 408 49 107
"apple" 55 297 432 210
"Apple" 23 338 405 338
"100" 438 283 310 357
"101" 439 284 251 211
"?" 484 504 29 236
"!" 454 474 308 333
Now both hash functions seem like they're choosing random-seeming buckets. (Mostly, anyway.)
Although there are good (and bad!) choices for a hash function, the truth is that there is no single best hash function, because hash functions exist in a space of trade-offs (including execution time, code complexity, and quality).
In general, if we have \( K \) bits of arbitary key data to hash, and a hash function that produces an \( H \)-bit number, where \( H \lt K \), there will be some collisions. A key question is whether those collisions will be randomly distributed or not—in a good hash function they will be, even if the input has a pattern of some kind to it.
Bonus—Hashing in the Real World
In this bonus-material section, we'll take a look at hash functions in the real world.
Hashing in Python
Python's dictionaries are implemented as hash tables! Python has a built-in hash function for strings.
Take a look at Python's hash function (as run in our Docker image):
cs70 DOCKER > python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[6139113701375271244, 7746277032553241810, 541651195540198260]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[6139113701375271244, 7746277032553241810, 541651195540198260]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[6139113701375271244, 7746277032553241810, 541651195540198260]
>>>
As you can see, it produces a 64-bit signed number, and in the three runs above, the same string always produces the same number. For example, Hashing
hashes to 6139113701375271244
.
But let's run Python again:
cs70 DOCKER > python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[1710372791478860608, -8869751227348875821, -8034928388742224104]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[1710372791478860608, -8869751227348875821, -8034928388742224104]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[1710372791478860608, -8869751227348875821, -8034928388742224104]
>>>
In this run, Hashing
always hashes to 1710372791478860608
.
Every time you run Python, it's using a different hash function!
Have a think about why that might be…
Whoa, a different hash function every time??
Yes. But it's actually just changing some magic numbers, not the entire algorithm.
The reason Python uses a different hash function every time you run it is so no one can know ahead of time what strings might generate a collision! So it's really hard to trick Python's dictionaries into performing badly.
Cryptographic Hashes
Lots of data in the world needs to be checked to make sure it's the real thing, and hashes can be used to provide a checksum value.
The SHA-256 message-digest algorithm produces a 256-bit hash out of an arbitrary amount of data.
cs70 DOCKER > ls -l /bin/ls
-rwxr-xr-x 1 root root 133792 Jan 18 2018 /bin/ls
cs70 DOCKER > sha256sum /bin/ls
0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c /bin/ls
Here we've computed the hash for the ls
program in our Docker image. You can check yours, it should be the same!
But most importantly, it is infeasible for someone to produce another program that would have the same 256-bit hash value.
The SHA-256 algorithm is a good to use for verifying a file against a provided checksum, but it would be too slow to use in a typical hash-table implementation.
(When logged in, completion status appears here.)