CS 70

Data Structure Scenario: Spell Checker

  • Duck speaking

    So, of all the data structures we've seen, which one is the best one?

  • LHS Cow speaking

    Each one has a different set of strengths and weaknesses.

  • RHS Cow speaking

    The “best” data structure is the one that's best for your particular application.

  • LHS Cow speaking

    Let's look at some scenarios and see if we can figure out some answers.

A Simple Spell Checker

In this scenario, you've been tasked with writing a simple spell checker.

Your spell-checker should read in text from the command line, and report all of the misspelled words.

You'll know a word is “misspelled” if it is not in the set of words you have read from a file of words (such as web2).

Which Interface?

The data structure interfaces that we've discussed fall into some broad categories. Though different specific encodings vary a bit in what they can offer, at a high level we have:

  • Linear data structures (e.g., list, stack, queue)
    • Good at keeping things in a particular order.
    • Not always good at searching the data for a particular item.
  • Associative data structures (e.g., set, map/dictionary)
    • Good at searching the data for a particular item.
    • Not always good at keeping track of the order of things.
  • Priority queue (kind of a category of its own)
    • Good at finding the minimum item (or the maximum item if we set it up that way instead)
    • Not good at searching the data for a particular item.
    • Not good at keeping things in a particular order (although we can keep removing the minimum).

Notice that we've said “not always good at X” in the list above, because the answer can depend on the specific variant we choose and the exact details of our scenario.

Which interface given above would be the most natural choice for this scenario?

List all the kinds that could work, even if they're not the most natural choice.

A spell checker's functionality is primarily about search—for each string we need to check whether it is in a collection of words or not.

So that would suggest using an associative container. Here we just need to keep track of which strings are in the collection—not associate them with values—so a set seems more appropriate than a map/dictionary.

More Details

  • Cat speaking

    But we learned about several different data structures that could support a set interface.

  • LHS Cow speaking

    Let's dive deeper to figure out which one to use!

What questions might you ask yourself when deciding what data structure to use to hold your set of words? List at least three.

There are lots of things we might need to think about. Here are a few:

Do we expect more successful lookups or unsuccessful lookups?

It's reasonable to expect that most words will be correctly spelled and misspellings will be relatively rare.

Our program will probably spend most of its time looking up strings that are in the collection, so we probably care more about the complexity of successful lookups than unsuccessful ones.

  • Rabbit speaking

    Actually…

  • LHS Cow speaking

    Let's leave this assumption for now. We'll come back to it at the end.

Do we care more about insertion or lookup? Do we care about removal?

While most modern spell checkers allow you to add or remove words from their dictionary, it's reasonable to expect that doing so will be an infrequent operation.

So, while we do want insertion and removal to be reasonably quick, we probably care most about the complexity of lookups.

How much variation in complexity can we tolerate?

Although this hasn't always been the case, these days people generally expect spell checkers to work in real time, checking words as they write. A little bit of variability probably won't be noticeable, but no one wants their email client to randomly hang while it does some really slow operation.

To prevent uneven user experience, we may need to pay attention to the worst-case cost of a lookup and how likely it would be to encounter that worst case.

Are lookups repeated?

One neat thing about human languages is that a small number of words tend to be used very frequently, while most words are used very infrequently.

Some English-language words happen to appear very frequently in any text; for example, the and and.

But word frequency is also related to the subject of a text! Once you see the word complexity, for example, there's a decent chance you'll see it again in the same document.

How much space overhead can we tolerate?

Our collection of words is probably pretty large and will take up a lot of memory on its own.

Thus a small amount of constant overhead is no big deal. However, if there is significant memory overhead per word, that usage can really add up.

So we probably need to worry about how much extra information is stored for each key.

Getting Some Data

We might not always be able to do so, but to help answer some questions about overheads and time performance of spell checkers in practice, we've obtained some real-world data from an experiement with the minispell program we saw in various homeworks, using three different data structures from the C++ standard library:

We load 272,962 words, requiring 272,962 std::string objects at 32 bytes each, for 8530 KB total, with an additoinal 216 KB of string data that overflows out of the std::string objects onto the heap (a clever optimization fits small strings inside the std::string object itself). Thus the string data itself takes 8746 KB, or an average of 32.8 bytes per word.

Now let's see how much additional space the different options consume beyond the data itself, and how long they take to run:

DOCKER> ./minispell-array -d large.dict alice30.words
Reading words from large.dict... done!
Sorting dictionary... done!
 - sorting took 0.0480655 seconds
 - 272962 words in the std::vector
 - estimated memory overhead: 0 KB
Reading words from alice30.words... done!
Looking up these words in the dictionary... done!
 - looking up took 0.00538796 seconds
 - 27324 words read, 24636 in dictionary

DOCKER> ./minispell-stdset -d large.dict alice30.words
Reading words from large.dict... done!
Inserting into dictionary (in order read)... done!
 - insertion took 0.0732073 seconds
 - 272962 words in the std::set
 - estimated memory overhead: 6397 KB
Reading words from alice30.words... done!
Looking up these words in the dictionary... done!
 - looking up took 0.00795221 seconds
 - 27324 words read, 24636 in dictionary

DOCKER> ./minispell-unorderedset -d large.dict alice30.words
Reading words from large.dict... done!
Inserting into dictionary (in order read)... done!
 - insertion took 0.0453838 seconds
 - 272962 words in 324503 buckets of std::unordered_map
 - estimated memory overhead: 4667 KB
Reading words from alice30.words... done!
Looking up these words in the dictionary... done!
 - looking up took 0.00130192 seconds
 - 27324 words read, 24636 in dictionary

In this example we can see that the exactly sized std::vector has 0% overhead, std::map has 73% overhead, and std::unordered_set has 53% overhead. But we know that using std::vector is impractical if we might need to insert additional words later.

Brainstorming

  • Dog speaking

    So??? What should we use?!?

  • LHS Cow speaking

    Let's brainstorm some pros and cons for different structures.

Randomized BST?

  • Pro: Fast (for a tree) dictionary loading if the dictionary is stored in sorted order due to the shape of random trees.
  • Meh: Logarithmic expected complexity for operations.
  • Meh: We can probably tolerate the randomness in operation times.
  • Con: Significant memory overhead per key (two pointers and a subtree size).
  • Con: Repeated lookups can be repeatedly expensive.

Splay Tree?

  • Pro: Repeated lookups are extra efficient.
  • Pro: Extremely fast dictionary loading if the dictionary is stored in sorted order.
  • Con: Significant memory overhead per key (two pointers).
  • Con: Some operations can be expensive (linear time).
    • But occasional expensive operations are always balanced by many preceding cheap ones.

Red–Black Tree?

  • Pro: Logarithmic worst case.
  • Meh: Logarithmic best case for insertion.
  • Con: Significant memory overhead per key (two pointers and a boolean color).
  • Con: Repeated lookups can be repeatedly expensive.

Hash Table (separate chaining)?

  • Pro: Expected constant-time lookup.
  • Meh: Medium–low memory overhead per key (one pointer).
  • Meh: We can probably tolerate the randomness in lookup times.
  • Meh: Insert is amortized expected constant time, so it will occasionally take linear time.
    • But for loading the dictionary, we care about total execution time, not time to load a particular word, so amortized time is fine.
  • Con: Repeated lookups can be repeatedly expensive.

Hash Table (open addressing chaining)?

  • Pro: Expected constant-time lookup.
  • Pro: Minimal memory overhead per key.
  • Meh: With double hashing, can keep load factor relatively high, but still some overhead from empty buckets.
  • Meh: We can probably tolerate the randomness in lookup times.
  • Meh: Insert is amortized expected constant time, so it will occasionally take linear time.
    • But for loading the dictionary, we care about total execution time, not time to load a particular word, so amortized time is fine.
  • Con: Repeated lookups can be repeatedly expensive.

Sorted Array-Backed List

  • Pro: If sized correctly, has minimal overhead.
    • Con: If dynamically sized, could have a lot of overhead if it's only half full.
  • Pro: Logarithmic worst case for search.
  • Meh: Not really an obvious choice when you want an associative container.
  • Con: All data must be loaded at once—it is not efficient (linearithmic time) to insert new words.
  • Dog speaking

    No, but… what should we use?

  • LHS Cow speaking

    The trade-offs are finely balanced. If we just tip or preferences slightly, we get different answers.

  • RHS Cow speaking

    Let's explore!

Given the constraints we've laid out, BSTs probably aren't a great choice. The memory cost of storing two pointers per node is significant compared to the key size and doesn't offer much benefit in exchange. In our benchmark test, although std::map was very fast, it was still about 6× slower than std::unorderd_map.

The one exception might be splay trees, if the benefit of temporal locality is significant enough to outweigh some of these other downsides. Probably the only way to judge this trade-off would be to measure it empirically. It seems unlikely that they would be 6× faster.

A hash table seems like a good choice. Our linear-probing hash table performed very well in our tests. Double hashing would be even more space efficient and faster, so could be a pretty good choice. We can set the load factor pretty high for really low space overhead and still expect cheap lookups. However, we would have to be careful to pick a good hash function for our data, otherwise our expected-complexity estimates would be way off. Also, hash tables don't exploit temporal locality like splay trees do (although maybe they could with a few modifications…).

Even though it doesn't support insert well, the sorted std::vector isn't out of the running. In our benchmark, it was slightly faster than std::map. And, if we're careful about avoiding wasted empty space in the vector, it's the most space-efficient option.

  • Duck speaking

    So do real spell checkers usually use hash tables? Or splay trees?

  • LHS Cow speaking

    Mudd's Prof. Kuenning wrote a pretty well-known spell checker for Unix systems called “International Ispell”. It uses a hash table. But hash tables aren't the only choice.

  • Rabbit speaking

    There are other structures, like tries, that are really good for storing strings by exploiting similarities in word pieces. (Although the memory overhead for tries can be problematic.)

  • LHS Cow speaking

    Also, modern spell checkers typically recommend some possible replacement words. Being able to efficiently find words that are similar to the misspelled string requires even more cleverness.

Bonus

  • Pig speaking

    I want to know MORE about that last bit.

  • Dog speaking

    Yeah, what would a spelling checker that found similar words look like?

  • LHS Cow speaking

    Again, there are lots of ways to solve it. Prof. Kuenning's ispell tries looking up variations on the word you typed when it can't find it as-is. That approach causes a massive number of unsuccessful searches, as most variations aren't words either.

  • RHS Cow speaking

    But since you asked, here's a quick attempt using a different approach that doesn't increase unsuccessful searches.

Here's a session with that code:

Enter word to check: sossige
Soundex code for sossige is s220
Found 11 similar words
Word is not spelled correctly, did you mean:
  sausage (distance 3)
(type 'quit!' to quit, or press control-C)
Enter word to check: baycun
Soundex code for baycun is b250
Found 15 similar words
Word is not spelled correctly, did you mean:
  bacon (distance 2)
  basin (distance 3)
  beacon (distance 3)
  begun (distance 3)
  bosun (distance 3)
(type 'quit!' to quit, or press control-C)
Enter word to check: leagle
Soundex code for leagle is l240
Found 7 similar words
Word is not spelled correctly, did you mean:
  legal (distance 3)
  legally (distance 3)
(type 'quit!' to quit, or press control-C)
Enter word to check: cool
Soundex code for cool is c400
Found 16 similar words
Word is spelled correctly
(type 'quit!' to quit, or press control-C)
Enter word to check: quit!

(Note that the OnlineGDB version has a pretty small dictionary due to file-size limits.)

(When logged in, completion status appears here.)