Search
For my detective work, I need to keep track of students, and I want to be able to say who is on campus and who isn't.
That kind of problem is pretty common—let's make it an example!
We often want to store a bunch of data in a way that makes it easy to search through.
Example: Suppose that Watson wants to store the student IDs of all the students who are on campus. This is a large set of student IDs. We want to be able to quickly answer the question, “Is this student on campus?”, given their ID number.
To solve this problem, there are two fundamental operations that we'll need:
INSERT: Add a new item to the collection.
LOOKUP: Find an item in the collection (or determine that it is not there).
This seems a lot like dictionaries (and maybe sets?) in Python.
Yes, it's the same idea.
- A map maps keys to values. In the container, we store each key together with its associated value. LOOKUP takes a key and gives access to the stored value for that key. Python dictionaries are an example of a map.
- A set is just a set of keys. We only store keys in the container, and LOOKUP takes a key and determines whether it is there. Python sets are an example of a set.
So in the dining hall, they might use a
map
from student ID to how many MORE meals that student has left to eat this semester?Sure. And a set can solve Watson's problem from earlier.
We'll focus on sets, but in fact everything we do for sets can be easily modified to solve the map problem instead. (In many ways, a map is just like a set where we're storing key/value pairs in the set.)
So what ideas do we have for Watson's problem?
A Bad Idea
Let's just make a big array of all the ID numbers!
Okay, let's consider something like
std::vector
, which is an array-backed list.
Performance of an Array-Backed List
INSERT: The good news is that adding new numbers to an array-backed list is efficient. We just need to push them onto the back of the array, which takes \( \Theta(1) \) time.
(For now we'll ignore the cost of “resizing” the array when necessary. We'll get there eventually!)
LOOKUP: The bad news comes when we try to find a particular item in the array. Because the numbers are not arranged in any particular way, we have no choice but to iterate through the array until we find the item (or don't).
First, let's think about best and worst cases:
- In the worst case (the item is last or not present) it takes \( n \) operations, \( \Theta(n) \) time.
- In the best case it only takes one operation, \( \Theta(1) \) time.
So the overal complexity is no worse than linear (but it could be better). Hence \( \mathrm{O}(n) \) is the most precise thing we can safely say in general.
Thus, looking up an item in an arbitrary array takes \( \mathrm{O}(n) \) time.
I want MORE speed than that!
A More Clever Bad Idea
Okay, if we want to make lookup easier, let's keep the array sorted!
Performance of a Sorted Array-Backed List
LOOKUP: It is true that lookup is easier in a sorted array! It's easier for the same reason that it's easier to find a word in a dictionary than it is to find that word in a novel.
Just like looking something up in a dictionary, we start in the middle of the list and then, depending on whether our item should come before or after the item at the midpoint, search the half that contains the item we are looking for. You may have seen this algorithm before; it's called binary search. (If you haven't, that's okay!).
Since binary search rules out half of the list at every iteration, LOOKUP will never take more than \( \log{n} \) steps to find an item (or determine that it is not present). It could, of course, find the item right away. So the complexity is no worse than logarithmic, but it could be better. So the lookup complexity would be \( \mathrm{O}( \log{n} ) \). Pretty nice!
INSERT: This time it's insertion that's the bad news. When we have a new item to insert, we have to put it in the right place to keep the list sorted. We can use binary search to find the place, so that's easy, but once we find it, we have to shift items over to make room.
In the worst case we would have to shift all \( n \) items over. But in the best case, we might not have to shift any items over. So the complexity is no worse than linear, but it could be better. So \( \mathrm{O}(n) \) is the most accurate and precise characterization.
A sorted array makes lookup faster. We can use binary search, which takes \( \mathrm{O}( \log{n} ) \) time!
But now we have to keep the array sorted, which means that insertion takes \( \mathrm{O}(n) \) time.
An Even More Clever Bad Idea
Okay, okay. So it's hard to insert things into an array-backed list. But it's easy to insert things into a linked list! So let's use a linked list!
Peformance of a Linked List
INSERT: It's true that we can insert things into a linked list very easily! If we know where to perform the insert, we can do it in \( \Theta(1) \) time! So the complexity of insert here depends entirely on the complexity of lookup.
LOOKUP: Now the problem is lookup again! Even if we have a sorted linked list, we can't perform binary search!
Why not?
Looking up a word in an actual paper dictionary is easy because you can flip directly to pages partway through the book. A linked list does not allow that—linked lists are terrible at random access!
So we'd have to go back to linear search (just going through the list one item at a time), which takes \( \mathrm{O}(n) \) time.
BTW, I don't think my generation uses paper dictionaries, or telephone directories, or the Yellow Pages™. We just use a search engine on the internet.
Or we text them!!
Yeah. Talking is hard.
Meh. No website, no business.
Sorry, those things were good examples of how alphabetization (sorting) helped, but they've kinda fallen out of use. But I'm sure you can think of some other physical-world examples where alphabetization is still useful.
Sure, I keep my cans of food alphabetized so I can get MORE things in the cupboard. So anyway, am I out of luck in my quest for MORE speed…?
A Better Idea
There is a data structure that combines the strengths of both a sorted array and a linked list.
Yeah! What is it?!
It needs to organize the data according to an order, like a sorted list. But it needs to be made of nodes, like a linked list.
It's called a binary search tree!
Oh! I think I saw that in a previous class, but I've forgotten all the details now!
(When logged in, completion status appears here.)