Expected Complexity of Randomly Generated BSTs
Expected Average Depth
Let's suppose that we have a typical random tree. How much worse is it going to be to use that tree compared to a perfectly balanced tree? To explore this question, we'll look at how deep a typical node is in each kind of tree.
We define the depth of a node as the path length from the root of the tree (thus the height of the tree is the depth of the deepest node).
Average Depth in a Perfect Tree
In a perfectly balanced tree, half the nodes are at the leaves, half of the remaining nodes are at the next level, half of those remaining nodes at the next level, and so on. So the average node depth for a perfect tree of height \( h \) with \( n = 2^{h+1} - 1 \) nodes is
\[ \begin{align} D_{ \mathrm{perfect}}(n) &= \frac{1}{n} \sum_{i=0}^h i 2^i \\ &= \tfrac{1}{n} (2^{h+1} (h-1)+2) \\ &= \tfrac{n+1}{n} \log_2 (n+1)-2 \\ &\in \Theta( \log n) \end{align} \]
*gulp* Are you expecting me to derive things like that myself?!
No. We think your math courses probably have given you the core skills to do so, but we don't expect it. We'll also just quote the next one, although it's very possible to do the math to derive it.
Talk to us about it if you're interested!
Expected Average Depth in a Random Tree
In the literature for random trees (or if you derive it yourself), you can find a similar formula for the expected average depth of a node in a random tree:
\[ \begin{align} D_{ \mathrm{random}}(n) &= 2 (1+1/n) \mathrm{H}_n - 4 \\ & \approx 2 ( \log n + \gamma -2) \\ &\in \Theta( \log n) \end{align} \] where \( \mathrm{H}_n\) is the \(n^{ \mathrm{th}}\) harmonic number, and \( \gamma\) is Euler's constant.
Average Depth Compared
So they're both \( \Theta( \log n)\), that's great!
Meh. I want to know how much worse random trees are, and the \( \Theta( \log n)\) form magics all that way as if we shouldn't care.
True, but we don't just have the the asymptotic forms, we also have exact formulas too, so we can answer your question!
\[ \lim_{n \to \infty} \frac{D_{ \mathrm{random}}(n)}{D_{ \mathrm{perfect}}(n)} = \frac{2 \log n}{ \log_2 n} = 2 \log 2 \approx 1.38629 \]
So we expect that the average depth of nodes in a random BST would be 38.6% worse than it would be for a perfect tree.
With this informaton in hand, we can think about the complexity of BST operations.
Expected Complexity (Uniform Random Lookup)
The derivation above, told us exactly how deep an “average node” is in a typical random tree. And the lookup cost is very similar, it's actually \(D_{ \mathrm{random}}(n) + 1\) (the +1 part because we care about the counting nodes in the tree we compare against, not counting the edges between nodes).
So we know that if our tree is a random tree, the expected lookup cost for a randomly chosen key is \( \Theta( \log n)\).
What if the key isn't chosen randomly? What if we picked the worst one, or the best one?
Good point. Let's think about that…
Expected Complexity (Arbitrary Lookup)
The best key to look up is the root, which is \( \Theta(1)\).
The worst key to look up is the one deepest in the tree, so that would depend on the height of the tree. Random trees are untidy, but we can expect them acceptably balanced, so the expected height of the tree is also \( \Theta( \log n)\).
If we look for something at the root of the tree, it's \( \Theta(1) \) time, and if we look for something at the bottom of the tree, it's \( \Theta( \log n) \) expected time.
So if we make no assumption about how likely any particular key is, it's \( \mathrm{O}(\log n)\) expected time.
Similarly, insert and remove are \( \mathrm{O}(\log n)\) expected time with no assumptions about the key (and \( \Theta(\log n)\) if we make a uniformity assumption).
Limitations
Oh! That's great news! So BSTs are saved!
Yes, it's very cool. But let's not get too excited yet.
First, this complexity bound is for expected time. The formula for \(D_{ \mathrm{random}}(n)\) was for a random node in a random tree. In reality, there is a distribution of possible depths, and that distribution has a very long tail. The theoretical worst case remains as \( \Theta(n)\), even though that is vanishingly unlikely if the tree is random.
Second, our probability model may or may not be reasonable. We've assumed a random tree. But inserting data in sorted order (or nearly sorted order) won't make a random tree, it'll make a very unbalanced tree that isn't random looking at all.
(When logged in, completion status appears here.)