CS 70

Written Questions

There are multiple ways to represent sequences of items (a.k.a. lists). In this assignment, we've focused on making a linked list, but previously we've looked at array-like structures—both C++'s built-in (primitive) fixed-size arrays, and std::vector, which behaves like a flexibly sized array.

std::vector provides its “flexible array” by using a primitive array behind the scenes, but only using a portion of the array to store data, with the remaining space unused. When there is too much or too little empty space, it reallocates the array (following a doubling strategy like the one we saw for Quinn's “smart train” in the last assignment).

Every approach has different trade-offs, which we can analyze theoretically and explore empirically.


Your task is to explore the performance of sequence data structures from both a theoretical and practical perspective, and answer some questions on Gradescope along the way.

Applying Theory: Two Designs for an IntVector Class

Let's consider two designs for a class like C++'s std::vector, which we'll call IntVector.

Design 1

In this first design, the class contains the following code:

Defining the encoding of the vector:

 private:
    // Data members
    int* arr_;              // Contains the elements of the vector
    size_t capacity_;       // Actual size of underlying array
    size_t size_;           // Number of elements actually stored

The indexing operation (operator[]):

int& IntVector::operator[](size_t index) {
    assert(index < size_);              // Detect improper use

    return arr_[index];
}

pop_front() is written as

int IntVector::pop_front() {
    assert(size_ > 0);                  // Detect improper use

    int firstVal = arr_[0];             // Save value to return

    // From front to back, shift items over
    for (size_t i = 1; i < size_; ++i) {
        arr_[i - 1] = arr_[i];
    }

    --size_;
    downsizeIfNeeded();

    return firstVal;
}

(push_front() is analogous to pop_front().)

Also, notice that the iterator is encoded as a pointer into the underlying primitive array.

Design 2

The second design is similar, but uses a technique known as a circular array. Here, the encoding adds one more data member:

private:
    // Data members
    int* arr_;         // Contains the elements of the vector
    size_t capacity_;  // Actual size of underlying array
    size_t size_;      // Number of elements actually stored
    size_t start_;     // Index of the first item in the vector

The first element of the vector (index 0), is not necessarily at the start of the underlying array, so we use the start_ data member to keep track of its location.

Here is the code for the indexing operation:

int IntVector::operator[](size_t index) const {
    assert(index < size_);              // Detect improper use

    // The array is circular, so we wrap around modulo the capacity
    return arr_[(start_ + index) % capacity_];
}

Although this approach's operator[] has slightly more code than in our first example, the pop_front() operation has less code here.

int IntVector::pop_front() {
    assert(size_ > 0);                  // Detect improper use

    int firstVal = arr_[start_];        // Save value to return

    start_ = (start_ + 1) % capacity_;  // Drop first element

    --size_;
    downsizeIfNeeded();
    return firstVal;
}

Questions

On Gradescope, in Question 1, you'll find some questions asking you to analyze these two designs, and contrast them with your IntList.

Development

To go beyond theory, we need actual running code. Sometimes, getting running code requires debugging!

Questions

Using either your understanding from the description of Design 2 above, or from looking at the actual code in intvector.cpp in the provided homework code, produce a memory diagram for the code below at the point shown, and submit a photo of it for Question 2 on Gradescope.

int main() {
    IntVector vec{4};   // Create a vector with capacity 4, so it won't
                        // need to resize in this example.

    cout << "Pushing 3 items to the back: 54, 42, 7" << endl;
    vec.push_back(54);
    vec.push_back(42);
    vec.push_back(7);

    // Remove front item and print it
    cout << "Popping first item: " << vec.pop_front() << endl;

    // Create an iterator to the first item
    IntVector::iterator it = vec.begin();

    // <-- Draw a memory diagram at this point.

    // (there is a little more code after this in intvector-example.cpp)

    return 0;
}

Empirical Exploration

The scientific process relies on experimentation to test hypotheses and theories. An experiment can't prove that a theory is true, but it can refute a theory.

Asymptotics answer the question “how does performance scale?”. We can look at this idea empirically. as well.

Empirical exploration can help us understand how different data structures perform in practice. In this part, you will run some experiments to compare the performance of different list implementations.

Running listperf

We've provided you with a program, listperf.cpp, which you'll use to run some performance experiments and gather data. You can compile this program by running make listperf. (The source file has a comment at the top describing the program which you can read, but you do not have to closely examine the code.)

  • Hedgehog speaking

    What's a “deque”, and why is it called that?

  • LHS Cow speaking

    It's pronounced “deck”, and it's short for double-ended queue, because we can add and remove items from both the front and the back.

listperf measures the performance of several sequence (a.k.a. list) data structures:

  • IntList, a singly linked list, which you wrote.
  • IntVector, a vector following Design 2 above, which we provided.
  • std::list, a doubly linked list from the C++ standard library.
  • std::forward_list, a singly linked list from the C++ standard library.
  • std::vector, from the C++ standard library.
  • std::deque, from the C++ standard library.

Run with no options, the program produces the following help output:

DOCKER > ./listperf
Usage: ./listperf [options]
Options (at least one must be selected):
  -h, --help       Print this message and exit.
  -b, --push-back  Measure push_back.
  -f, --push-front Measure push_front.
  -e, --equal      Measure equality (equal lists).
  -u, --unequal    Measure equality (unequal lists).

Pick one of these options and run it as a test. Notice that because we care about how performance scales as the problem size increases, we use an exponentially increasing set of sizes (100, 300, 1000, 3000, 10000, 30000, 100000, 300000, 100000), increasing by half an order of magnitude (a factor of \( \sqrt{10} \)) each time.

  • Duck speaking

    Wow, it's cool that I can see how well my code does against the industrial-strength code in the standard library!

Note that std::forward_list does not have a way to add to the back of the list, so it is not included in the -b/--push-back test.

Saving Your Results: Exporting in a Machine-Readable Format

Although you can use I/O redirection to save the output to a file (e.g., with ./listperf --push-back > Experiments/listperf-pb.out), doing so will leave you waiting while the program runs, with no indication of how long it will take to finish or what is happening. Instead, you can use the tee command, which will send the output to both the terminal and a file.

./listperf --push-back | tee Experiments/listperf-pb.out

Although listperf's output is quite human-readable, if we wanted to put it into a spreadsheet or graph-drawing program it helps to have the data available in a standard format. Two common formats are TSV (tab-separated values) and CSV (comma-separated values). We've provided a program, perf2tsv.py that can convert the output into one of these formats. You can run the script as

./listperf --push-back | tee Experiments/listperf-pb.out
./perf2tsv.py --time --csv < Experiments/listperf-pb.out > Experiments/listperf-pb.csv

In this example, the --time option tells the script to extract the time data (whereas --space would extract memory-usage data), and the --csv option tells it to produce a CSV file. (Omitting the --csv option will produce a TSV file instead). Finally, because different programs (and people) might want the data arranged in different ways, there is also a --transpose option, which swaps the rows and columns.

Drawing Graphs

The graph below shows a plot of space usage. Recall that to test how the algorithm scales we use exponentially increasing sizes, so we've used a log–log plot. (The graph omits IntVector, which has the exact same line as std::vector, and std::forward_list, which has the same line as IntList.)

A Log-Log Plot of Space Usage
1. A Log–Log Plot of Space Usage.

In a log–log plot, the function \( f(n) = k n^p \) will be a straight line, with slope \( p \), at a distance from the x-axis proportional to \( \log k \). Thus \( \Theta(n) \) and \( \Theta(n^2) \) functions can both have straight lines, just with different slopes. So we can see that the graph in Figure 1 is consistent with all the functions having the same big-\( \Theta \) performance, but we can't just look at the graph and immediately see what the slope is.

But theory can help us. For these experiments, theory indicates that these algorithms use linear space, thus storing \( n \) items should require \( \Theta(n) \) space. So, we can see if our experimental results are consistent with theory by dividing by \( n \). In other words, if we think \( f(n) \in \Theta(n) \), plot \( f(n)/n \), because a horizontal slope is easy to see.

The graph in Figure 2 adopts this approach and plots space-usage-per-element, and thus divides by \( n \).

A Log-Log Plot of Space Usage per Element
2. A Log–Log Plot of Space Usage Per Element.
  • LHS Cow speaking

    The perf2tsv.py script has a handy option, --divide-by, to do this kind of division automatically—just say -divide-by n or -divide-by n^2 or even -divide-by 'n log n'.

  • Goat speaking

    You like writing tools, don't you?

  • LHS Cow speaking

    Writing code is fun, especially when you can solve a problem with it.

  • Goat speaking

    I'd just put the numbers in a spreadsheet and do it that way.

  • Horse speaking

    Hay! Your graphs have lots more points than we get when we run listperf.

  • LHS Cow speaking

    Yes, there is a constant you can change in listperf, EXP_STEP. It's set to 0.5, but if you change it to 0.25, there will be twice as many points. You may also want to increase SIG_DIGITS to 2.0 or 3.0 if you change EXP_STEP.

  • Goat speaking

    Meh. I've got enough points. Shoulda been a command-line option anyway.

Questions

On Gradescope, we ask you some questions about the space-usage graphs.

We also ask you to draw a graph yourself using the data you've collected. We expect that you have enough experience drawing graphs in other classes that you can quickly generate a graph. Drawing nice graphs can be a time sink, but try to rein in how deeply you go down that rabbit hole. If you just want to look at the data and hand draw a graph on a whiteboard, that's okay, too, as long as the graph is actually correct.

Drawing Graphs

  • Rabbit speaking

    It looks like the graphs on this page were drawn using Mathematica, which all students have a license to use. It can do all kinds of things! Mudd students can get access to download it (or use it on-line) here. (Others should check with their home college's IT department.)

  • LHS Cow speaking

    But there are other options, too… Like Google Sheets.

Command-Line Graph Drawing with Python and Matplotlib
  • Goat speaking

    I hate Google Sheets. It's clumsy. Can't you just give me something that'll draw the graph for me?

  • LHS Cow speaking

    Okay… You don't have to use it, but Prof. Melissa did write a handy tool to really easily draw log–log graphs like the ones on this page.

The tool is called plotdata.py, and it's provided with the assignment files. Here's a command sequence that recreates one of the graphs above.

./listperf --push-back | tee Experiments/listperf-pb.out

./perf2tsv.py --space < Experiments/listperf-pb.out > Experiments/listperf-pb-space.tsv

unset DISPLAY

./plotdata.py --auto-transpose --markers --divide-by 'n' --xlabel Size  --title 'push_back performance' --ylabel 'Space per Element (bytes/elem)' --output Experiments/listperf-pb-space-div-n.png < Experiments/listperf-pb-space.tsv

(The first time the script runs, it'll install some Python packages for matplotlib, which may take a few moments two as they're downloaded and installed.)

  • RHS Cow speaking

    This is just one possible graph you could draw. If you change what you're measuring, be sure to change the arguments to the lot program to give it the right labels.

You can also run

python3 plotdata.py --help

to learn more about the plotdata.py program.

  • Rabbit speaking

    If you run it on your own computer, rather than in the Docker image, it can show you the graph interactively on your screen.

  • LHS Cow speaking

    Yes, but that's not required.

  • Goat speaking

    Meh. None of that's required. I might just draw it out on paper!

Further Exploration

If you have the time and you're curious, you can also explore some of the differences in performance that are created by the compiler. You can recompile your code using GCC's g++ compiler (GCC stands for the “GNU Compiler Collection”) by running

make clean
make CXX=g++ listperf

Now run the same tests you ran before with this new executable!

  • Rabbit speaking

    Now you can use the data to run all sorts of statistical analyses between the two sets of results!

  • Goat speaking

    *rolls eyes*

To Complete This Part of the Assignment…

You'll know you're done with this part of the assignment when you've done all of the following:

(When logged in, completion status appears here.)