Written Questions
There are multiple ways to represent sequences of items (a.k.a. lists). In this assignment, we've focused on making a linked list, but previously we've looked at array-like structures—both C++'s built-in (primitive) fixed-size arrays, and std::vector
, which behaves like a flexibly sized array.
std::vector
provides its “flexible array” by using a primitive array behind the scenes, but only using a portion of the array to store data, with the remaining space unused. When there is too much or too little empty space, it reallocates the array (following a doubling strategy like the one we saw for Quinn's “smart train” in the last assignment).
Every approach has different trade-offs, which we can analyze theoretically and explore empirically.
Your task is to explore the performance of sequence data structures from both a theoretical and practical perspective, and answer some questions on Gradescope along the way.
- Open HW05 Written Questions on Gradescope — But keep this page open and keep reading!
Applying Theory: Two Designs for an IntVector
Class
Let's consider two designs for a class like C++'s std::vector
, which we'll call IntVector
.
Design 1
In this first design, the class contains the following code:
Defining the encoding of the vector:
private:
// Data members
int* arr_; // Contains the elements of the vector
size_t capacity_; // Actual size of underlying array
size_t size_; // Number of elements actually stored
The indexing operation (operator[]
):
int& IntVector::operator[](size_t index) {
assert(index < size_); // Detect improper use
return arr_[index];
}
pop_front()
is written as
int IntVector::pop_front() {
assert(size_ > 0); // Detect improper use
int firstVal = arr_[0]; // Save value to return
// From front to back, shift items over
for (size_t i = 1; i < size_; ++i) {
arr_[i - 1] = arr_[i];
}
--size_;
downsizeIfNeeded();
return firstVal;
}
(push_front()
is analogous to pop_front()
.)
Also, notice that the iterator is encoded as a pointer into the underlying primitive array.
Design 2
The second design is similar, but uses a technique known as a circular array. Here, the encoding adds one more data member:
private:
// Data members
int* arr_; // Contains the elements of the vector
size_t capacity_; // Actual size of underlying array
size_t size_; // Number of elements actually stored
size_t start_; // Index of the first item in the vector
The first element of the vector (index 0), is not necessarily at the start of the underlying array, so we use the start_
data member to keep track of its location.
Here is the code for the indexing operation:
int IntVector::operator[](size_t index) const {
assert(index < size_); // Detect improper use
// The array is circular, so we wrap around modulo the capacity
return arr_[(start_ + index) % capacity_];
}
Although this approach's operator[]
has slightly more code than in
our first example, the pop_front()
operation has less code here.
int IntVector::pop_front() {
assert(size_ > 0); // Detect improper use
int firstVal = arr_[start_]; // Save value to return
start_ = (start_ + 1) % capacity_; // Drop first element
--size_;
downsizeIfNeeded();
return firstVal;
}
Questions
On Gradescope, in Question 1, you'll find some questions asking you to analyze these two designs, and contrast them with your IntList
.
Development
To go beyond theory, we need actual running code. Sometimes, getting running code requires debugging!
Questions
Using either your understanding from the description of Design 2 above, or from looking at the actual code in intvector.cpp
in the provided homework code, produce a memory diagram for the code below at the point shown, and submit a photo of it for Question 2 on Gradescope.
int main() {
IntVector vec{4}; // Create a vector with capacity 4, so it won't
// need to resize in this example.
cout << "Pushing 3 items to the back: 54, 42, 7" << endl;
vec.push_back(54);
vec.push_back(42);
vec.push_back(7);
// Remove front item and print it
cout << "Popping first item: " << vec.pop_front() << endl;
// Create an iterator to the first item
IntVector::iterator it = vec.begin();
// <-- Draw a memory diagram at this point.
// (there is a little more code after this in intvector-example.cpp)
return 0;
}
Empirical Exploration
The scientific process relies on experimentation to test hypotheses and theories. An experiment can't prove that a theory is true, but it can refute a theory.
Asymptotics answer the question “how does performance scale?”. We can look at this idea empirically. as well.
Empirical exploration can help us understand how different data structures perform in practice. In this part, you will run some experiments to compare the performance of different list implementations.
Running listperf
We've provided you with a program, listperf.cpp
, which you'll use to run some performance experiments and gather data. You can compile this program by running make listperf
. (The source file has a comment at the top describing the program which you can read, but you do not have to closely examine the code.)
What's a “deque”, and why is it called that?
It's pronounced “deck”, and it's short for double-ended queue, because we can add and remove items from both the front and the back.
listperf
measures the performance of several sequence (a.k.a. list) data structures:
IntList
, a singly linked list, which you wrote.IntVector
, a vector following Design 2 above, which we provided.std::list
, a doubly linked list from the C++ standard library.std::forward_list
, a singly linked list from the C++ standard library.std::vector
, from the C++ standard library.std::deque
, from the C++ standard library.
Run with no options, the program produces the following help output:
> ./listperf
Usage: ./listperf [options]
Options (at least one must be selected):
-h, --help Print this message and exit.
-b, --push-back Measure push_back.
-f, --push-front Measure push_front.
-e, --equal Measure equality (equal lists).
-u, --unequal Measure equality (unequal lists).
Pick one of these options and run it as a test. Notice that because we care about how performance scales as the problem size increases, we use an exponentially increasing set of sizes (100, 300, 1000, 3000, 10000, 30000, 100000, 300000, 100000), increasing by half an order of magnitude (a factor of \( \sqrt{10} \)) each time.
Wow, it's cool that I can see how well my code does against the industrial-strength code in the standard library!
Note that std::forward_list
does not have a way to add to the back of the list, so it is not included in the -b
/--push-back
test.
Saving Your Results: Exporting in a Machine-Readable Format
Although you can use I/O redirection to save the output to a file (e.g., with ./listperf --push-back > Experiments/listperf-pb.out
), doing so will leave you waiting while the program runs, with no indication of how long it will take to finish or what is happening. Instead, you can use the tee
command, which will send the output to both the terminal and a file.
./listperf --push-back | tee Experiments/listperf-pb.out
Although listperf
's output is quite human-readable, if we wanted to put it into a spreadsheet or graph-drawing program it helps to have the data available in a standard format. Two common formats are TSV (tab-separated values) and CSV (comma-separated values). We've provided a program, perf2tsv.py
that can convert the output into one of these formats. You can run the script as
./listperf --push-back | tee Experiments/listperf-pb.out
./perf2tsv.py --time --csv < Experiments/listperf-pb.out > Experiments/listperf-pb.csv
In this example, the --time
option tells the script to extract the time data (whereas
--space
would extract memory-usage data), and the --csv
option tells it to
produce a CSV file. (Omitting the --csv
option will produce
a TSV file instead). Finally, because different programs (and people) might want the data arranged in different ways, there is also a --transpose
option, which swaps the rows and columns.
Drawing Graphs
The graph below shows a plot of space usage. Recall that to test how the algorithm scales we use exponentially increasing sizes, so we've used a log–log plot. (The graph omits IntVector
, which has the exact same line as std::vector
, and std::forward_list
, which has the same line as IntList
.)
In a log–log plot, the function \( f(n) = k n^p \) will be a straight line, with slope \( p \), at a distance from the x-axis proportional to \( \log k \). Thus \( \Theta(n) \) and \( \Theta(n^2) \) functions can both have straight lines, just with different slopes. So we can see that the graph in Figure 1 is consistent with all the functions having the same big-\( \Theta \) performance, but we can't just look at the graph and immediately see what the slope is.
But theory can help us. For these experiments, theory indicates that these algorithms use linear space, thus storing \( n \) items should require \( \Theta(n) \) space. So, we can see if our experimental results are consistent with theory by dividing by \( n \). In other words, if we think \( f(n) \in \Theta(n) \), plot \( f(n)/n \), because a horizontal slope is easy to see.
The graph in Figure 2 adopts this approach and plots space-usage-per-element, and thus divides by \( n \).
The
perf2tsv.py
script has a handy option,--divide-by
, to do this kind of division automatically—just say-divide-by n
or-divide-by n^2
or even-divide-by 'n log n'
.You like writing tools, don't you?
Writing code is fun, especially when you can solve a problem with it.
I'd just put the numbers in a spreadsheet and do it that way.
Hay! Your graphs have lots more points than we get when we run
listperf
.Yes, there is a constant you can change in
listperf
,EXP_STEP
. It's set to 0.5, but if you change it to 0.25, there will be twice as many points. You may also want to increaseSIG_DIGITS
to 2.0 or 3.0 if you changeEXP_STEP
.Meh. I've got enough points. Shoulda been a command-line option anyway.
Questions
On Gradescope, we ask you some questions about the space-usage graphs.
We also ask you to draw a graph yourself using the data you've collected. We expect that you have enough experience drawing graphs in other classes that you can quickly generate a graph. Drawing nice graphs can be a time sink, but try to rein in how deeply you go down that rabbit hole. If you just want to look at the data and hand draw a graph on a whiteboard, that's okay, too, as long as the graph is actually correct.
Drawing Graphs
It looks like the graphs on this page were drawn using Mathematica, which all students have a license to use. It can do all kinds of things! Mudd students can get access to download it (or use it on-line) here. (Others should check with their home college's IT department.)
But there are other options, too… Like Google Sheets.
Command-Line Graph Drawing with Python and Matplotlib
I hate Google Sheets. It's clumsy. Can't you just give me something that'll draw the graph for me?
Okay… You don't have to use it, but Prof. Melissa did write a handy tool to really easily draw log–log graphs like the ones on this page.
The tool is called plotdata.py
, and it's provided with the assignment files. Here's a command sequence that recreates one of the graphs above.
./listperf --push-back | tee Experiments/listperf-pb.out
./perf2tsv.py --space < Experiments/listperf-pb.out > Experiments/listperf-pb-space.tsv
unset DISPLAY
./plotdata.py --auto-transpose --markers --divide-by 'n' --xlabel Size --title 'push_back performance' --ylabel 'Space per Element (bytes/elem)' --output Experiments/listperf-pb-space-div-n.png < Experiments/listperf-pb-space.tsv
(The first time the script runs, it'll install some Python packages for matplotlib
, which may take a few moments two as they're downloaded and installed.)
This is just one possible graph you could draw. If you change what you're measuring, be sure to change the arguments to the lot program to give it the right labels.
You can also run
python3 plotdata.py --help
to learn more about the plotdata.py
program.
If you run it on your own computer, rather than on the server, it can show you the graph interactively on your screen.
Yes, but that's not required.
Meh. None of that's required. I might just draw it out on paper!
Further Exploration
If you have the time and you're curious, you can also explore some of the differences in performance that are created by the compiler. You can recompile your code using GCC's g++
compiler (GCC stands for the “GNU Compiler Collection”) by running
make clean
make CXX=g++ listperf
Now run the same tests you ran before with this new executable!
Now you can use the data to run all sorts of statistical analyses between the two sets of results!
*rolls eyes*
(When logged in, completion status appears here.)