Phase 6: Randomized Insert
In this step, you will extend your insert
functionality to allow for inserting randomly, simulating the behavior of a random tree.
You'll follow the usual development stages (planning, writing tests, implementing, verifying).
You will need to submit a planning image for this phase, Planning/Phase_6.jpg
.
Steps
Add a New Data Member
Recall from the specifications that your tree should have a member variable of type RandUInt32
, which it will use to generate random numbers.
Add One More Constructor
Add one more constructor to the TreeSet
class:
- The
TreeSet
parameterized constructor that takes two parameters: abst_style
and asize_t
. The second parameter will be the seed for the random number generator.
See the Generate Random Numbers help page for information about how to initialize a RandUInt32
object with a seed (and/or the Week 10, Lesson 2 page on random-number generation).
Update insert
Now add a third conditional to your insert
function to call a different private helper member function if the tree’s type is RANDOMIZED
.
In that private helper function, you will need to randomly decide whether to insert at root for the current subtree or recursively call the randomized insert function on a lower subtree. You must use the random-number generator (RNG) in a specific way so that we can set the seed for your tree and reproducibly check the outcome. Do not use the tree's random-number generator at any point other than in these lines:
int randomInt = rand_.get(nodeSize(node) + 1);
bool doRootInsert = (randomInt == 0); // Root insert with(1/new_subtree_size probability)
These lines first use the RNG to uniformly randomly pick an integer from 0 to nodeSize(node)
, which is equivalent to rolling a nodeSize(node) + 1
-sided die! We arbitrarily pick one possible “side of the die”—0—to mean that we should insert and move to the root of this subtree; any other value means we should keep moving down the tree.
Your naming might differ, but please stick to this method for making the random decision of whether to insert or recurse. If you don't, your code will fail tests in the autograder!
(See the hints below for some tips on testing!)
Helpful Hints
Testing Randomized Insert
Testing randomized algorithms can be tricky! In particular, you might have been writing tests that do some operations and then compare the string representation of the tree to what you expected the tree structure to be. In most cases you can't do that here, because the outcome will be random!
Here are some ideas:
- Insertion of items in sorted order builds a stick with both the
LEAF
andROOT
insertion styles, butRANDOMIZED
trees will roughly balance the tree. Although you can't say for sure what theaverageDepth
orheight
functions will return, we know some options are incredibly unlikely. See the probabilities hint below. - Two randomly generated trees with the same elements probably won't print the same because the shape of the tree will be different (in fact, for a large tree, the probability of them being identical is vanishingly small), but they should compare as equal.
- Two randomly generated trees with the same seed and the same elements but inserted in a different order can be guaranteed to not print the same, but should compare as equal.
- If you know the seed, you can actually guarantee that every CS 70 randomized-BST implementation will print the exact same tree, because they are required to follow the exact same insertion algorithm, using the same random-number generator in the same way. We actually test your trees this way.
There are also some things that must be true, no matter how the randomness works out. For example, after inserting a key, the key must be in the set!
Expected Average Depth
Week 10, Lesson 2 gives some probabilities for the average node depth in a random tree. We'll provide the same kind of information here, but turned into easily computable approximations for the expected average depth of a random BST:
\[ \begin{align} \text{expected value} &= 2 \left(1+\frac{1}{n}\right) H_n-4 \\ & \approx 2 \left(\frac{1}{n}+1\right) (\log n+\gamma+ \frac{1}{2n} )-4 \\ & \in \Theta(\log n) \end{align} \]
\[ \begin{align} \text{variance} &= 7 - 4 \mathrm{H}_n^{(2)} - \frac{2 \left( \mathrm{H}_n+2 \mathrm{H}_n^{(2)} \right)}{n^2} + \frac{-2 \mathrm{H}_n - 8 \mathrm{H}_n^{(2)} + 13}{n} \\ &\approx 7 - \frac{2 \pi^2}{3} + \frac{17 - 2 \gamma - \frac{4}{3} \pi^2 - 2 \log n}{n} + \frac{5 - 2 \gamma - \frac{2}{3} \pi^2 - 2 \log n}{n^2} \\ & \approx_{n \to \infty} 7 - \frac{2 \pi^2}{3} \approx 0.420264 \\ & \in \Theta(1), \end{align} \]
where \( \gamma \approx 0.577216 \) is Euler's constant and \( \mathrm{H}_n \) are the harmonic numbers.
These approximations are accurate even for quite small \( n \), being within 1% of the true value when \( N \geq 5 \) and becoming more accurate as \( n \) increases.
We can turn the above into C++ code (which requires #include <cmath>
):
// Tree formulas provided by Prof. Melissa, based on
// https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimPDF/qsort.pdf
constexpr double EGAMMA = 0.57721566490153286; // Euler's constant
constexpr double PI_SD3 = 3.2898681336964529; // (Pi^2)/3
double avgDepthExpected =
2. * (1. + 1. / N) * (log(N) + EGAMMA + 1/(2. * N)) - 4.;
double avgDepthStdDev =
sqrt(7. - 2.*PI_SD3 + (17. - 2.*EGAMMA - 4.*PI_SD3 - 2.*log(N)) / N
+ (5 - 2*EGAMMA - 2*PI_SD3 - 2. * log(N)) / (N * N));
For example, when n = 100
, these formulas calculate avgDepthExpected
as 6.47852 and avgDepthStdDev
as 0.59483. The true values to the same precision are 6.47850 and 0.59483, respectively.
Statistically, it is unlikely that we'll get a result four standard deviations away from the expected value, or we'd repeatedly get results two standard deviations away from the expected value. You can use these kinds of statistical properties in your tests.
(When logged in, completion status appears here.)