CS 70

Hash Function Properties

  • Cat speaking

    Earlier, when we were doing the expected-time analysis, we made some assumptions about our hash functions. Can we go over that?

  • Pig speaking

    Yes, and learn MORE about what makes a good hash function?

  • LHS Cow speaking

    Okay!

Say we have a table with 100 buckets that stores English words as strings. Here are some candidate hash functions. What do you think?

  • return key[0]; // First character
  • return key.size(); // Length of the string
  • return sumOfLetters(key); // Ignore punctuation and case, sum the letter values (a=1, b=2, etc.)
  • return rand(); // Random number

Pick at least one of the above hash functions and explain why it's not a good choice.

None of these are very good choices.

First Character: This option is very susceptible to patterns in our input. If the strings in our table are English-language words, there are some letters that are far more likely to occur at the beginning of words than others. The result could be that some chains are really long and others are really short.

Length of the String: We have 100 buckets, but if our strings are English-language words they are going to clump up at the small end of the table and we won't use all the buckets! Even pneumonoultramicroscopicsilicovolcanoconiosis only has 45 letters.

Sum of Letters: This option (which is a lot like the sumOfASCIIvalues function we used earlier) is not quite as obviously a bad idea, but it still has problems. For instance, the order of characters doesn't matter at all. So cause and sauce would both be hashed to the same place. Furthermore, you can add one to one character and subtract one from another and keep the same hash value. For example, if we add one to c (to get d) and subtract one from u (to get t), we now have dates and sated hashing to that same position. You can keep doing substitutions, and even merge or split letters to get additional collisions with words like fry and abdicated. It's also the case that shorter (more common) words will tend to cluster near the smaller buckets— English-language words tend to cluster around particular hash values.

Random Number: The appealing thing about this approach would be that we would know that our strings would be spread evenly throughout the whole table! So what's wrong? If the hash function gives out random numbers, then we'll never be able to find anything!

Some Required and Desirable Properties of Hash Functions

  • Deterministic: For a given key, the hash function must always give the same hash value—otherwise we won't be able to find keys we've inserted!
  • Uniform: It should spread keys evenly in the table (patterns in input don't result in patterns in output)—avoids long chains.
  • Avalanche Effect: Ideally, similar keys should have very different hash values.
    • It's called the “avalanche effect” because we want a single-bit change in the input data to cause an “avalanche” that flips about 50% of the bits in the hash.
    • To increase the odds of that happening, it's best if the hash function uses all input data.
  • Efficient to Compute: The hash function should be efficient so that insert and lookup are efficient.
    • But other important properties shouldn't be sacrificed in the name of efficiency.
  • Duck speaking

    Is the avalanche effect the same thing as uniform?

  • LHS Cow speaking

    They are very related. Uniformity is what we ultimately want, but having good avalanche effect is part of how we get there.

  • RHS Cow speaking

    In lots of problems we expect keys to have similarities/regularities, and maybe keys in one application (like student records) will be more similar to each other than they are to keys for a different application (like serial numbers).

  • LHS Cow speaking

    So the avalanche effect is a generally useful principle that helps make the distribution more uniform across lots of applications.

An Example “Good” Hash Function for Strings

There are lots of better hash functions out there. For instance, here is a widely used one, djb2 (so called because of its author, Dan Bernstein):

size_t djb2(const std::string& key) {
    constexpr size_t PRIME_INIT = 5381;
    constexpr size_t PRIME_MULT = 33;

    size_t hash = PRIME_INIT;

    for (char c : key)
        hash = hash * PRIME_MULT + c;
    }

    return hash;
}
  • Duck speaking

    Uh… why is this a good hash function?

  • LHS Cow speaking

    This process of repeatedly multiplying and shifting the hash value is similar to a type of pseudorandom number generator (PRNG) called a Linear Congruential Generator. Those operations help to spread the hash values out and give similar keys very different hash values.

  • Duck speaking

    And why does it have those magic numbers?

  • Hedgehog speaking

    Magic numbers are bad, right?

  • LHS Cow speaking

    Yes, in general, but magic numbers are pretty common in hash functions. Both 33 and 5381 were carefully chosen.

  • RHS Cow speaking

    In general, the design of hash functions mixes mathematical ideas with large doses of empiricism.

That said, we showed you djb2 in part because it is quite simple. In C++, you'd probably just use std::hash<std::string> to hash strings.

  • LHS Cow speaking

    Curiously perhaps, std::hash<std::string> is a class, not a function. You create an object of this class, and then call it like it was a function.

Here's a comparison of djb2 against std::hash<std::string> (using the standard library from our Docker image—other standard libraries will use different implementations):

                               djb2             std::hash<std::string>
"CS 70 is fun!"                cff011b09c8ee528 f1dbef56ef1d76f9
"CS 60 is fun!"                cfefe77a98845407 582ea8887b7608ba
"CS 5 is fun!"                 b8b9afde2e1448b6 eaa596f1a888de31
"apple"                        000000310f199e37 bf304ea8f09bf7b0
"Apple"                        000000310cd68e17 ca599938d987eb95
"100"                          000000000b8789b6 57fd0d0c6011a336
"101"                          000000000b8789b7 14310e04694ccefb
"?"                            000000000002b5e4 4a4bc9e3b003a81d
"!"                            000000000002b5c6 746009e5abda5534
  • Goat speaking

    djb2 doesn't look quite as random.

  • Cat speaking

    Yes, it's not really getting that “avalanche” property, is it?

  • LHS Cow speaking

    True. But for most hash-table implementations, it's good enough.

  • RHS Cow speaking

    Notice that the results in both cases are 64-bit integers (shown in hex).

When we use the value from a hash function, we wrap it modulo the number of buckets. Here's the results of doing that for two possible table sizes, 512 (a power of two) and 509 (a prime).

                               djb2                     std::hash<std::string>
                               % 512    % 509           % 512   % 509
"CS 70 is fun!"                296      229             249     390
"CS 60 is fun!"                7        400             186     19
"CS 5 is fun!"                 182      408             49      107
"apple"                        55       297             432     210
"Apple"                        23       338             405     338
"100"                          438      283             310     357
"101"                          439      284             251     211
"?"                            484      504             29      236
"!"                            454      474             308     333
  • LHS Cow speaking

    Now both hash functions seem like they're choosing random-seeming buckets. (Mostly, anyway.)

Although there are good (and bad!) choices for a hash function, the truth is that there is no single best hash function, because hash functions exist in a space of trade-offs (including execution time, code complexity, and quality).

In general, if we have \( K \) bits of arbitary key data to hash, and a hash function that produces an \( H \)-bit number, where \( H \lt K \), there will be some collisions. A key question is whether those collisions will be randomly distributed or not—in a good hash function they will be, even if the input has a pattern of some kind to it.

Bonus—Hashing in the Real World

In this bonus-material section, we'll take a look at hash functions in the real world.

Hashing in Python

Python's dictionaries are implemented as hash tables! Python has a built-in hash function for strings.

Take a look at Python's hash function (as run in our Docker image):

cs70 DOCKER > python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[6139113701375271244, 7746277032553241810, 541651195540198260]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[6139113701375271244, 7746277032553241810, 541651195540198260]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[6139113701375271244, 7746277032553241810, 541651195540198260]
>>>

As you can see, it produces a 64-bit signed number, and in the three runs above, the same string always produces the same number. For example, Hashing hashes to 6139113701375271244.

But let's run Python again:

cs70 DOCKER > python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[1710372791478860608, -8869751227348875821, -8034928388742224104]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[1710372791478860608, -8869751227348875821, -8034928388742224104]
>>> [hash(x) for x in ["Hashing", "is", "fun!"]]
[1710372791478860608, -8869751227348875821, -8034928388742224104]
>>>

In this run, Hashing always hashes to 1710372791478860608.

  • RHS Cow speaking

    Every time you run Python, it's using a different hash function!

  • LHS Cow speaking

    Have a think about why that might be…

  • Dog speaking

    Whoa, a different hash function every time??

  • LHS Cow speaking

    Yes. But it's actually just changing some magic numbers, not the entire algorithm.

The reason Python uses a different hash function every time you run it is so no one can know ahead of time what strings might generate a collision! So it's really hard to trick Python's dictionaries into performing badly.

Cryptographic Hashes

Lots of data in the world needs to be checked to make sure it's the real thing, and hashes can be used to provide a checksum value.

The SHA-256 message-digest algorithm produces a 256-bit hash out of an arbitrary amount of data.

cs70 DOCKER > ls -l /bin/ls
-rwxr-xr-x 1 root root 133792 Jan 18  2018 /bin/ls
cs70 DOCKER > sha256sum /bin/ls
0d06f9724af41b13cdacea133530b9129a48450230feef9632d53d5bbb837c8c  /bin/ls

Here we've computed the hash for the ls program in our Docker image. You can check yours, it should be the same!

But most importantly, it is infeasible for someone to produce another program that would have the same 256-bit hash value.

The SHA-256 algorithm is a good to use for verifying a file against a provided checksum, but it would be too slow to use in a typical hash-table implementation.

(When logged in, completion status appears here.)