Randomized Trees
An Idea: Shuffling Keys
Hay, I have an idea!
Go for it!
What if we randomly shuffle the order of all the keys before we insert them?
Oh, interesting!
One of the limitations of our expected-case analysis is that we relied on an assumption that all insertion orders are equally likely. That might not be true in every problem!
One way to remove that limitation is to randomly shuffle the keys before we insert them.
In that case we know that all orderings are equally likely! We essentially force our probabilistic assumption to be true and guarantee that the expected complexity is logarithmic.
Meh. That seems pretty impractical.
Oh?
Yeah! In order to shuffle the keys, you need to know all of the keys beforehand.
Huh. That's true.
What if the keys come in one at a time? Then we can't shuffle them before we insert them!
This is a good point. While sometimes we might be able to know the keys ahead of time and shuffle them before we insert them, that's not always the case. If only we had some other way… If only we could have a tree that “shuffled itself” as we inserted the keys…
I'm not sure I like where this is going… Is it going to be complicated?
Actually, it's not too bad. It's a pretty cool idea, and it's not too hard to implement! Let's dive in!
Randomized BSTs
When you have a new key, it follows a single path from the root to a leaf in the tree (greater than this to go right, less than this so go left, and so on…).
Randomized trees keep themselves balanced by randomly selecting where along that path the new node should end up.
The basic idea is that at each node along the path, we randomly choose whether to insert the node deeper in the tree or to insert the node such that it becomes the root of this subtree (i.e., right in this spot). For now, don't worry about how we would cause the new node to become root of a subtree; we'll get to that later.
Randomized trees simulate the process of shuffling the keys by moving the randomness into insert, instead of having all of the randomness happen by shuffling ahead of time! Rather than shuffling the data before inserting, we shuffle the arrangement of the nodes within the tree as we insert them. (Without breaking the BST property, of course!)
If we remember that we want our trees to have the same characteristics as trees generated from shuffled keys, we can use what we know about random shuffling to help us figure out the details of how we want randomized insert to work.
Randomized Insertion Example
Suppose we have this tree, where the "20" node is the one we want to insert:
If we were shuffling the keys beforehand, then we would have had all of the values—including the 20—before we started building the tree!
Since BST insertion always adds values to the bottom of the tree, we know that the root will be whatever value was added first. That means that for "20" to have been the root, it would need to have been first in the list of values after they were shuffled.
There are already 11 values in the tree, and one value we're trying to add.
If we had all of the values ahead of time, we would have had 12 values to shuffle.
Every value has an equal likelihood of being first when we shuffle the list, so each value (including the 20) has a \( \frac{1}{12} \) probability of being the first one in the list.
In general, if we have \( N \) values already in our tree, then we should have a \( \frac{1}{ N + 1 } \) probability that the new value becomes the root!
Inserting in the 11-Node Subtree
Let's roll a 12-sided die to represent our \( \frac{1}{12} \) probability that we insert at the root.
If we roll a 1, we'll insert at the root of this tree!
We didn't get a 1.
That means we don't insert at the root. So we insert at the leaf, right?
No, it just means we don't insert at the root. Like other things in trees, our randomized insert is going to be recursive.
Recursive randomness?!! My tummy hurts.
If we had to do all the insertions all at once, it could be pretty complicated. But remember the beauty of recursion!
We just have to handle one case at a time, and we push all of the complicated stuff into the recursive call.
Our die roll told us that we didn't want to insert at the root of our tree. That means we're going to insert it somewhere in the left subtree, since 20 < 37.
Where in the left subtree will it end up? That's up to recursion!
There are \( N = 3 \) values currently in the subtree whose root is 19 and we have one more to add.
If we were shuffling the keys beforehand, whichever one of those values came first in the shuffled order would get to be the root of the tree.
Each key has an equal chance of being first, so 20 would be the root of the subtree with a probability of \( \frac{1}{ N + 1 } = \frac{1}{4} \).
Inserting in the 3-Node Subtree
We can use a four-sided die to decide whether we should insert at the root of this subtree. Again, we'll say that a 1 means "insert at root", and anything else means "recurse down into a subtree."
The simulated tetrahedron has generated the numeral 3.
What does that mean?
Recurse into a subtree!
You got it! Since 20 > 19, we'll go into the right subtree this time:
Remember that we're trying to simulate what would happen if we were able to shuffle all of our values up and generate random trees.
Of all the trees that had 20 and 29 in their own subtree, we would expect 20 to be the root half the time and 29 to be the root the other half of the time.
Inserting in the 1-Node Subtree
Oh boy! It's time for a coin flip!
Heads means we insert at root, tails means we recurse into the subtree.
Heads! That means insert at root!
Yes, and remember that the "root" is the root of this subtree. so we'll insert our new value above the 29.
And now our tree is
(When logged in, completion status appears here.)