CS 70, Fall 2002
Assignment 7: DNA Recombination

The program for this assignment, and everything else except the README file, is due at 10 P.M. on Wednesday, November 6th, 2002. As usual, the README file is due at 1 A.M. on the following day (Thursday). Refer to the homework policies page for general homework guidelines.

The primary purpose of this assignment is to get you used to writing C++ iterators. You will also be developing a preliminary list class. Both the list class and the iterator for it will be useful to you in later assignments.

NOTE: the list class you develop in this assignment will be central to assignments 8 and 9. MAKE SURE you develop it well and debug it thoroughly. If you blow this assignment off, you will do poorly on the following two assignments as well. A correct solution to this assignment will NOT be distributed to the class. It is your responsibility to be sure you have a properly working list class and iterator.

Overview

One of the more creative approaches to artificial intelligence is the genetic algorithm, invented by Prof. John Holland of the University of Michigan.

In brief, a genetic algorithm simulates the process of evolution by applying the usual rules of genetics to simulate natural selection. In real life, natural selection's primary goal is the continuation of the species, and organisms that achieve that goal tend to be propagated. In a genetic algorithm, on the other hand, the primary goal is to satisfy a "fitness function" chosen by the programmer. For example, a simple fitness function might interpret the genes of an organism as the value of x in a complicated equation. The natural-selection process could then be tuned to prefer organisms that generate an output near zero, so that the survivors would eventually produce a solution to the equation.

Genetic algorithms were the first step in the current research area called "artificial life," and they have been used to successfully solve many problems that were otherwise intractable.

In this assignment, we will create a program that uses a genetic algorithm to find approximate square roots of integers. Although it is simplified compared to a production implementation, the program demonstrates the basic outline and capabilities of a genetic algorithm.

There are three basic processes in evolution: mutation, crossover, and selection. Mutation involves selecting a gene site and modifying it in some fashion, usually by replacing it with another gene. Mutation is very rare both in real life and in genetic algorithms. Crossover is the most important process in generating new organisms. It involves taking two gene strings (usually from two parent organisms), cutting them both at the same point, and re-splicing them so that the head of the result comes from one parent and the tail from the other. Real genetic algorithms usually generate two children in this process, and may splice at more than one point, but we'll simplify things in our implementation.

The final step, selection, involves evaluating the organisms according to some criterion (the "fitness function") and choosing the ones that are most successful. In real life, selection is the harsh process of "survival of the fittest." In a genetic algorithm, the same method is used: the least fit organisms are discarded (i.e., killed) without being allowed to reproduce. As in real life, there is some randomness, so that a somewhat unfit organism has a chance of surviving even when a more fit one is discarded. This randomness turns out to be important to the success of the method, since any two slightly unfit parents might (through crossover) generate an extremely fit child.

Because we will not have time to implement an entire genetic algorithm from scratch, much of the code has been provided for you, although you will have to clean it up. You must supply the underlying data structure (a linked list).

Scenario

To pay some bills over the summer, you've taken a job at Lamarkian Enterprises, whose eventual goal is to use powerful computers to evolve master mathematicians. Their lead programmer, Ginny F. O'Fenugreek, began exploring the problem space by writing a genetic algorithm to calculate square roots. Unfortunately, she left the company part way through the project, claiming that she needed to spend more time building boats. Management took a look at Ginny's code and discovered that it had no Makefile and that almost all of it resided in a single file. Worse, the code defines very few C++ classes, choosing instead to implement most of its functionality via top-level functions, even though there are some obvious ways in which the code could be broken into classes. In addition, although Ginny obviously intended to create an integer-list class to use in the program, no trace of it exists in her home directory. Your manager has handed you Ginny's code to clean up. You will have to:

Data Structures

This section gives you some hints and requirements regarding the data structures that your final program will use.

Organism

Currently, the program represents an organism as a list of integers (see below). You should maintain that representation, but should wrap it in an Organism class that also provides the crossover, mutate, fitness, and intListToDouble functions. The class should also provide a comparison operator that can be used in compareFitness (the latter function must still exist so that qsort will work, although you may wish to convert it into a class static function if you know how to do so).

Colony

The program currently represents the entire colony as a simple array. You should wrap that array in a Colony class that provides the naturalSelection and findBest functions.

IntList

An organism is represented entirely by its gene sequence, which in turn is represented using a singlylinked list. Each element in the list will contain only a single integer from 0 to 9 (represented by the C++ type int), plus a link to the next element. The list must have a separate header that is not a plain element, which means that you must implement two classes (the header and the element). The cleanest approach is to make the element a nested private class of the header, so that only the header (IntList) is visible from outside.

You are not allowed to use a doubly linked list in this assignment.

Your linked list must be named IntList (so that it can be used by the main driver program) and must support the following operations. Note that, since the main driver program is supplied, the function names cannot be changed.

In addition, you must implement an output operator (operator<<) for IntList. I suggest that you use the technique suggested in Weiss: provide a public print function, and have operator<< call print. The output operator should write all the integers in the list concatenated together, with no blanks or newlines. (This design is a very poor approach in general, and will be changed next term. The right way to do it would be to separate the integers with blanks or commas.)

Finally, you may find it helpful to implement a few other standard list functions: pushHead, popHead, isEmpty, and possibly popTail. Several of these functions will be useful in future assignments, and you will find it much easier to do those assignments if you implement the functions now, while your list class is simple, rather than waiting until later when you have converted it into a templated class. However, only the list above is absolutely required.

IntListIterator

You must also implement an iterator for IntList, which must be named IntListIterator. The iterator must support the following functions at a minimum:

In addition, you may wish to support a copy constructor, assignment operator, and postincrement operator. It would not be appropriate to implement operator->, since int is not a class.

What You Need to Do

You are provided with a single file, assign_07.cc, which is the main driver program. As mentioned, the program is not particularly object-oriented. Examine the code to discover the logical relationships between the functions and then break the code up into separate files and classes. You should create at least two new classes reflecting the logical structure of the program. (A solution involving four new classes is quite possible.)

In your final code, you should find that the overall code looks simpler and is easier to follow. If you have done things properly many of the functions will have fewer arguments.

Note that your code must perform exactly the same as the existing code. This requirement means that you must take care when making changes involving the random-number generator so that you can be sure the same numbers are generated. (The random-number functions used by the code are described in the Unix manual pages drand48(3)).

You must create or modify the following files:

assign_07.cc
This must be the file that you downloaded from this Web page, modified to make it object-oriented.
Makefile
For this assignment, the Makefile will not be provided. You must write your own, and it must be correct. If you do not provide a Makefile, your program will not compile and you will receive a zero for functionality. Be sure your dependencies are correct; you may wish to use g++ -M to help.
intlist.hh
This file will contain the interface definition for the IntList and IntListIterator classes. Note that both classes must be defined by this file, either by placing both definitions in the file, or by having it #include whatever file(s) contain the remaining definitions.
*.hh
Any other header files that you feel are necessary to implement your code. (There is no requirement that there be any other header files, but you might find it useful.)
*.cc
Any other source files that you feel are necessary to implement your code.

Since assign_07.cc is provided to you, you must maintain stylistic consistency in that file. However, you are not required to use any specific coding style in the other files that you create. Since you are creating them from scratch, any good style is acceptable. In particular, you do not have to match the style of assign_07.cc in those files.

Emacs users may find it helpful to invoke C-c . stroustrup RET to choose the Stroustrup indentation style for assign_07.cc.

As usual, you can download the provided file as a bundle, either as a gzipped tar file or as a ZIP archive.

A Note on the static Keyword

In the provided code, there is a file-global variable (squaredValue) and a function (compareFitness) that really should be part of either the Colony or Organism class. To make that possible, you need to use the static C++ keyword.

Putting static in front of a class variable says "there is only one copy of this variable, and it is shared between all instances of the class." In other words, the variable becomes class-global. That's exactly what you need for squaredValue. You can declare it as a static double inside one of your classes, and it then becomes available only inside that class -- in particular, it becomes available to the fitness function.

There is one minor glitch, which is that C++ requires you to add an extra declaration in one of your .cc files (wherever you implement Organism or Colony):

    double Colony::squaredValue = 0.0;
or
    double Organism::squaredValue = 0.0;

You can also put static in front of a function declaration. In this case, the keyword means "this function will not be called on a particular object." In other words, instead of writing:

    Foo x;
    x.bar(3);
you would write just:
    Foo::bar(3);
Usually, you need to Foo:: to specify which function you are calling. A static function has no this pseudo-variable, and for that reason you can't reference any class member variables (unless they, too, are static).

The static function feature is perfect for compareFitness. You can declare it inside Colony or Organism as:

    static int compareFitness(const void* first, const void* second);
and then pass it to qsort as before (you may need to add a scoping operator before the name).

Submission Mechanics

For assignment 7, you must submit the following files:

Testing

Testing is your responsibility. We will not provide exact test cases for you. You should test your program a number of times, under different conditions.

In its default condition, the program is nondeterministic (i.e., two successive runs may produce different results). To make testing easier, the program accepts a switch that makes it deterministic. If you use "-S n", where n is an integer, the random seed will be set to that value. Specifying the random seed will allow you to control the program's behavior so that you can reproduce bugs.

You will also find it instructive to run the program with the -d switch, and to run it for many different values of the -g, -m, -p, -r, and -s switches. Judicious reading of the comments, together with experimentation, will reveal the purpose of these switches and how they interact.

We will not limit ourselves to running only simple test cases. You can expect that we will run stress tests in an attempt to break your program. I strongly suggest that you attempt to break it yourself, so that we won't be able to do so. In particular, make sure you ask it to find the roots of a lot of numbers, all on one command line.

Sample Runs

To make it clearer how the program is used, here are some sample runs. First, we can approximate the square root of 2000000 (which is just 1000 times the square root of 2). The "%" represents the command prompt.

% ./assign_07 -S 12345 2000000
0001415 * 0001415 = 2002225
If we start with a different random seed, we get a different result:
% ./assign_07 -S 54321 2000000
0001413 * 0001413 = 1996569
A third attempt gives a really bad answer:
% ./assign_07 -S 95 2000000
0000589 * 0000589 = 346921
Finally, we can change the number of generations (-g), the mutation rate (-m), the population size (-p) the selection pool size (-s, which should be smaller than the population size), and the number of randomly-chosen survivors (-r, which should usually be pretty small), and run with debugging (-d):
% ./assign_07 -S 1 -g 100 -m 0.1 -p 100 -s 50 -r 3 -d 2000000
Generation 0: 0003616
Generation 1: 0001993
Generation 5: 0001912
Generation 6: 0001608
Generation 7: 0001508
Generation 8: 0001501
Generation 11: 0001412
Generation 22: 0001414
0001414 * 0001414 = 1999396

Here are several more sample runs to help you ensure that your program still runs correctly when you're working after you've finished upgrading it to object-oriented style. Your output should match exactly, including the intermediate results:

% ./assign_07 -S 1 -d 1000000
Generation 0: 0003616
Generation 3: 0002035
Generation 4: 0002032
Generation 7: 0000759
Generation 9: 0000956
Generation 12: 0000971
Generation 13: 0001006
Generation 15: 0001001
Generation 20: 0000999
Generation 21: 0001000
0001000 * 0001000 = 1000000
% ./assign_07 -S 2 -d 2000000
Generation 0: 0001959
Generation 4: 0001524
Generation 7: 0001464
Generation 8: 0001452
Generation 9: 0001412
Generation 13: 0001414
0001414 * 0001414 = 1999396
% ./assign_07 -S 3 -d 3000000
Generation 0: 0000901
Generation 5: 0001637
Generation 8: 0001781
Generation 11: 0001694
Generation 13: 0001731
Generation 34: 0001732
0001732 * 0001732 = 2999824
% ./assign_07 -S 4 -d 4000000
Generation 0: 0022736
Generation 1: 0006776
Generation 2: 0001693
Generation 5: 0001913
Generation 7: 0001999
0001999 * 0001999 = 3996001
% ./assign_07 -S 5 -d 5000000
Generation 0: 0040409
Generation 2: 0006204
Generation 5: 0001799
Generation 10: 0002515
Generation 11: 0001972
Generation 12: 0002305
Generation 13: 0002172
Generation 15: 0002176
Generation 17: 0002182
Generation 19: 0002196
Generation 20: 0002199
0002199 * 0002199 = 4835601
% ./assign_07 -S 6 -d 6000000
Generation 0: 0006410
Generation 1: 0001708
Generation 5: 0002368
Generation 9: 0002427
Generation 12: 0002471
Generation 13: 0002458
Generation 14: 0002455
Generation 16: 0002447
Generation 17: 0002450
0002450 * 0002450 = 6002500
% ./assign_07 -S 7 -d 7000000
Generation 0: 0023819
Generation 1: 0019859
Generation 2: 0019771
Generation 3: 0010017
Generation 6: 0007115
Generation 7: 0002118
Generation 12: 0002299
Generation 13: 0002761
Generation 16: 0002699
Generation 17: 0002667
Generation 18: 0002644
Generation 19: 0002645
Generation 28: 0002646
0002646 * 0002646 = 7001316
% ./assign_07 -S 8 -d 8000000
Generation 0: 0021235
Generation 1: 0021223
Generation 2: 0000630
Generation 4: 0000653
Generation 5: 0002059
Generation 6: 0003167
Generation 8: 0003132
Generation 9: 0002983
Generation 11: 0002767
Generation 13: 0002883
Generation 15: 0002783
Generation 17: 0002797
Generation 19: 0002799
0002799 * 0002799 = 7834401
% ./assign_07 -S 9 -d 9000000
Generation 0: 0006835
Generation 2: 0006814
Generation 3: 0006319
Generation 4: 0003811
Generation 5: 0001902
Generation 6: 0003052
Generation 8: 0003023
Generation 9: 0003018
Generation 11: 0003000
0003000 * 0003000 = 9000000
% ./assign_07 -S 10 -d 10000000
Generation 0: 0010722
Generation 1: 0001641
Generation 2: 0001648
Generation 5: 0003097
Generation 16: 0003099
0003099 * 0003099 = 9603801

Note 1: you can think of the running time of the program O(population size * number of generations). Don't use huge numbers or you'll wait all day! (You may want to try to analyze the complexity of the program yourself to determine whether the correct bound is different.)

Note 2: If you don't specify the -S switch, you will get different results every time you run the program. That's a feature, not a bug.

Note 3: The defaults are:

Tricky Stuff

As usual there are some tricky parts to this assignment. Some of them are:


© 2002, Geoff Kuenning

This page is maintained by Geoff Kuenning.