CS 70, Spring 2004
Assignment 7: DNA Recombination

The program for this assignment, and everything else except the README file, is due at 9 P.M. on Wednesday, March 31st, 2004. As usual, the README file is due at midnight the same day (i.e., the moment that Thursday starts). Refer to the homework policies page for general homework guidelines.

The primary purpose of this assignment is to get you used to writing C++ iterators. You will also be developing a preliminary list class. Both the list class and the iterator for it will be useful to you in later assignments.

NOTE: the list class you develop in this assignment will be central to assignments 8 and 9. MAKE SURE you develop it well and debug it thoroughly. If you blow this assignment off, you will do poorly on the following two assignments as well. A correct solution to this assignment will NOT be distributed to the class. It is your responsibility to be sure you have a properly working list class and iterator.

Overview

The problem description for this assignment is long. Much of it is background that is not critical for you to understand thoroughly. The short version of what you must do is:

You must make your list class work before you can try out the genetic algorithm. We strongly suggest that you write small test programs to help debug your list class before you try to plug it into the assignment. For example, here is a trivial program that will let you step through the pushTail and popHead functions with the debugger:

	#include "intlist.hh"
	#include <cassert>

	int main(int, char *[])
	{
	    IntList foo;

	    foo.pushTail(1);
	    foo.pushTail(2);
	    foo.pushTail(3);

	    assert(foo.length() == 3);

	    assert(foo.popHead() == 1);
	    assert(foo.popHead() == 2);
	    assert(foo.popHead() == 3);

	    assert(foo.length() == 0);

	    return 0;
	}

Genetic Algorithms

One of the more creative approaches to artificial intelligence is the genetic algorithm, invented by Prof. John Holland of the University of Michigan.

In brief, a genetic algorithm simulates the process of evolution by applying the usual rules of genetics to simulate natural selection. In real life, natural selection's primary goal is the continuation of the species, and organisms that achieve that goal tend to be propagated. In a genetic algorithm, on the other hand, the primary goal is to satisfy a "fitness function" chosen by the programmer. For example, a simple fitness function might interpret the genes of an organism as the value of x in a complicated equation. The natural-selection process could then be tuned to prefer organisms that generate an output near zero, so that the survivors would eventually produce a solution to the equation.

Genetic algorithms were the first step in the current research area called "artificial life," and they have been used to successfully solve many problems that were otherwise intractable.

In this assignment, we will create a program that uses a genetic algorithm to solve the Traveling Salesrep Problem (TSP), a well-known problem in computer science. Although it is simplified compared to a production implementation, the program demonstrates the basic outline and capabilities of a genetic algorithm.

There are three basic processes in evolution: mutation, crossover, and selection. Mutation involves selecting a gene site and modifying it in some fashion, usually by replacing it with another gene value. Mutation is very rare both in real life and in genetic algorithms. (However, the TSP benefits from having a higher-than-normal mutation right.) Crossover is the most important process in generating new organisms. It involves taking two gene strings (usually from two parent organisms), cutting them both at the same point, and re-splicing them so that the head of the result comes from one parent and the tail from the other. Real genetic algorithms usually generate two children in this process, and may splice at more than one point, but we'll simplify things in our implementation.

The final step, selection, involves evaluating the organisms according to some criterion (the "fitness function") and choosing the ones that are most successful. In real life, selection is the harsh process of "survival of the fittest." In a genetic algorithm, the same method is used: the least fit organisms are discarded (i.e., killed) without being allowed to reproduce. As in real life, there is some randomness, so that a somewhat unfit organism has a chance of surviving even when a more fit one is discarded. This randomness turns out to be important to the success of the method, since any two slightly unfit parents might (through crossover) generate an extremely fit child.

Because we will not have time to implement an entire genetic algorithm from scratch, much of the code has been provided for you, although you will have to finish it. You must also supply the underlying data structure (a linked list).

Scenario

To pay some bills over the summer, you've taken a job at Lamarkian Enterprises, whose eventual goal is to use powerful computers to evolve master mathematicians. Their lead programmer, Ginny F. O'Fenugreek, began exploring the problem space by writing a genetic algorithm to solve the classic Traveling Salesrep Problem. Unfortunately, she left the company under mysterious circumstances after a visit from two soft-spoken representatives of the Amalgamated Union of Computer Theoreticians and Practitioners. Management looked into what Ginny had left behind and discovered that her code was fairly complete, but one function and one class (a linked list) remained unimplemented. Your manager has handed you Ginny's code to clean up. You will have to:

Data Structures

Ginny's program uses a number of data structures to help represent the problem:

City
Represents one city in the tour.
Cities
Collects all the cities into a variable-length array so that they can be easily referenced by other classes.
Organism
Represents a single organism (TSP tour).
Colony
Represents a collection of organisms that participate in evolution.
IntList
Represents a singly-linked list of integers.

City

To solve the TSP, we must know the distance between each pair of cities. The City class describes a single city, storing its name, latitude, and longitude. There are I/O functions to read and write the city as needed. The class also provides a distanceFrom function that calculates the great-circle distance between two cities. (Question for consideration: would have been better to implement this function by overloading the subtraction operator?)

Cities

It is convenient to store the cities as an array, so that each city can be referred to by an integer index. The Cities class provides a self-expanding array and a way to read a list of cities into it. (One could also use the STL vector class to achieve the same effect. We have provided our own version so that you will have a reference for how automatic expansion can be done.)

Organism

An Organism is at the heart of any genetic algorithm. Traditionally, the organism's genes are represented as integers, and the chromosomes are a sequence of integers. Ginny's program represents that sequence using a linked list.

For the Traveling Salesrep problem, the chromosome is interpreted as a singly-linked list of city numbers, in the order in which the cities should be visited. This means that each integer should appear exactly once in the list. Achieving this goal is not easy in a genetic algorithm, and has been the subject of much research (for more information, see the Larrañaga paper referenced in the comments).

The Organism supports functions to initialize a gene list, calculate and compare fitnesses, and print organisms. It also implements the two functions that create genetic diversity, crossover and mutate.

Colony

A group of Organisms is called a Colony. The Colony class supports an indexing operator ([]) to allow any organism to be accessed, plus several functions that emulate evolution. The most important are evolve, which simulates a number of generations, and naturalSelection, which chooses the organisms that will survive until the next generation. There is also a findBest function that returns the best organism created so far.

IntList

An organism is represented entirely by its gene sequence, which in turn is represented using a singly linked list. Each element in the list will contain only a single integer (represented by the C++ type int) between zero and the number of cities, plus a link to the next element. The list must have a separate header that is not a plain element, which means that you must implement two classes (the header and the element, or node). The cleanest approach is to make the element a nested private class of the header, so that only the header (IntList) is visible from outside. If the element is a nested private class, you can choose to make its data members public or even declare it as a struct.

You are not allowed to use a doubly linked list in this assignment.

Your linked list must be named IntList (so that it can be used by the main driver program) and must support the following operations at a minimum. Note that, since the code for the genetic algorithm has already been supplied, the function names cannot be changed.

The functions needed for your list class are:

Finally, you may find it helpful to implement a few other standard list functions: pushHead, isEmpty, and possibly popTail. Some of these functions will be useful in future assignments, and you will find it much easier to do those assignments if you implement the functions now, while your list class is simple, rather than waiting until later when you have converted it into a templated class. However, only the functions mentioned above are absolutely required.

IntListIterator

You must also implement an iterator for IntList, which must be named IntListIterator. The iterator must support the following functions at a minimum:

In addition, you may wish to support postincrement operator. A reset function (to reset the iterator to the beginning of the list) could be useful in the future but isn't needed for this assignment (the Organism code achieves the same effect by using the copy constructor and assignment operator). It would not be appropriate to implement operator->, since int is not a class.

What You Need to Do

You are provided with a number of files:

Makefile
A sample Makefile for the assignment.
assign_07.cc
The main driver for the genetic algorithm.
city.hh and city.cc
The City and Cities classes.
colony.hh and colony.cc
The Colony class.
organism.hh and organism.cc
The Organism class.
mtwist.c, mtwist.h, and mtwist.3
A fast random-number generation package. The ".3" file is the manual page; you can format it with "nroff -man mtwist.3 | less -s. For CS70 purposes, you can treat the random-number generator as an opaque package.
randistrs.c, randistrs.h, and randistrs.3
A package for generating random numbers from various distributions. For CS70 purposes, you can treat it as opaque.
tsp5.txt
A 5-city instance of the Traveling Salesrep Problem.
tsp15.txt
A 15-city instance of the Traveling Salesrep Problem.
tsp50.txt
A 50-city instance of the Traveling Salesrep Problem.
all_us_cities.txt
A list of 754 places in the United States that you can use to generate your own instances of the TSP.

As usual, you must get these files by using "cs70checkout hw07".

You must create or modify the following files:

organism.cc
This must be the file that you downloaded from this Web page, modified to add a working mutation function.
intlist.hh
This file will contain the interface definition for the IntList and IntListIterator classes. Note that both classes must be defined by this file, either by placing both definitions in the file, or by having it #include whatever file contains the remaining definitions.
intlist.cc
This file will contain the implementation of your integer-list class.
*.hh
Any other header files that you feel are necessary to implement your code. (There is no requirement that there be any other header files, but you might find it useful.)
*.cc
Any other source files that you feel are necessary to implement your code.

Since organism.cc is provided to you, you must maintain stylistic consistency when modifying that file. However, you are not required to use any specific coding style in the other files that you create. Since you are creating them from scratch, any good style is acceptable. In particular, you do not have to match the style of assign_07.cc in those files.

Emacs users may find it helpful to invoke C-c . stroustrup RET to choose the Stroustrup indentation style for assign_07.cc.

The mutate Function

Although the concept of a genetic algorithm is quite simple, it requires a fair amount of support code. The TSP is additionally complicated by the requirement that no genes be missing or duplicated. For that reason, we have provided most of the code for you. You only need to complete the mutate function.

To mutate a chromosome, you must choose two genes at random. This part of the code is provided for you. You must then search through the gene list, replacing gene1 with gene2 and vice versa. For example, if the chromosome is {5,4,1,2,3}, gene1 is 4, and gene2 is 3, the result after mutation should be {5,3,1,2,4}. Note that gene1 and gene2 are actual gene values, not positions.

The genetic algorithm will work even without the mutation code; it just won't produce the same results as the samples. Because of that, you can test your integer list before you write the mutation.

A Note on the static Keyword

In the provided code, there is a function named (Colony::compareFitness) that is declared static. Putting static in front of a function declaration means "this function will not be called on specific particular object." In other words if you have a class Foo with a static function bar, instead of writing:

    Foo x;
    x.bar(3);
you would write just:
    Foo::bar(3);
Usually, you need to use Foo:: to specify which function you are calling. A static function has no this pseudo-variable.

The compareFitness function is static because the qsort library routine needs a comparison function that doesn't take objects.

Submission Mechanics

As usual, you must check out the provided files by using "cs70checkout hw07".

When you have a working solution, you must submit your files with cs70submit. If you create any new files, you need to tell the submission system about them by mentioning them once on a cs70submit command line. For convenience, we have provided dummy versions of README, intlist.hh, and intlist.cc so that they will be sure to get submitted.

Make Depend

The provided Makefile contains a self-editing rule, which you can invoke with "make depend". This rule will automatically modify the Makefile to reflect dependencies among files. You should run "make depend" immediately after checking out your files, and whenever you add a #include statement to any file.

In the future, we won't be giving you sample Makefiles, so you should make sure you understand how make depend works so that you can duplicate it in subsequent assignments.

Testing

Testing is your responsibility. We will not provide exact test cases for you. You should test your program a number of times, under different conditions.

In its default condition, the program is nondeterministic (i.e., two successive runs may produce different results). To make testing easier, the program accepts a switch that makes it deterministic. If you use "-S n", where n is an integer, the random seed will be set to that value. Specifying the random seed will allow you to control the program's behavior so that you can reproduce bugs.

You may find it instructive to run the program with the -d switch, and to run it for many different values of the -g, -m, -p, -r, and -s switches. Judicious reading of the comments, together with experimentation, will reveal the purpose of these switches and how they interact.

We will not limit ourselves to running only simple test cases. You can expect that we will run stress tests in an attempt to break your program. We strongly suggest that you attempt to break it yourself, so that we won't be able to do so. In particular, make sure you run it with input and parameters that will cause it to take a fairly long time, and use "top" to watch its size over time. If it keeps growing, you have a memory leak.

Sample Runs

To make it clearer how the program is used, here are some sample runs. First, we can solve a 5-city problem. The "%" represents the command prompt.

% ./assign_07 -S 12345 tsp5.txt
Total distance: 8657.41
If we start with a different random seed, we get the same result, because in this case we are finding the optimal solution:
% ./assign_07 -S 54321 tsp5.txt
Total distance: 8657.41
If we try both seeds with a larger problem, we get different answers. In this case, the answers also differ depending on the machine you use (the reasons have to do with the representation of floating-point numbers). On an x86 (Intel) machine, we get:
% ./assign_07 -S 12345 tsp50.txt; ./assign_07 -S 54321 tsp50.txt
Total distance: 3694.28
Total distance: 3611.3
while on Sparc (Turing) we get:
% ./assign_07 -S 12345 tsp50.txt; ./assign_07 -S 54321 tsp50.txt
Total distance: 3687.05
Total distance: 3922.44
A third seed gives a notably worse result. On the x86:
% ./assign_07 -S 95 tsp50.txt
Total distance: 3874.1
And on the Sparc:
% ./assign_07 -S 95 tsp50.txt
Total distance: 3847.73
Finally, we can change the number of generations (-g), the mutation rate (-m), the population size (-p) the selection pool size (-s, which should be fairly small), and the number of randomly-chosen survivors (-r, which should usually be very small), and run with debugging (-d) to watch things develop. On both the x86 and the Sparc we get:
% ./assign_07 -S 1 -g 10 -m 0.5 -p 100 -s 5 -r 0 -d tsp50.txt
10 generations
100 organisms
5 survive in each generation
0 survive randomly
0.5 probability of mutation
Generation 0: 16358.2
Generation 1: 15409.9
Generation 2: 13619.5
Generation 3: 13018.2
Generation 4: 12414.5
Generation 5: 11845
Generation 6: 10082
Generation 7: 9919.01
Generation 8: 9663.63
Generation 9: 9302.05
Total distance: 8859.05

In most situations, a mutation rate of 0.5 would be disastrously high. In the TSP, however, it seems to work better than smaller values.

Here are three more sample runs to help you ensure that your program still runs correctly when you're working after you've finished writing the mutation code. Your output should match exactly, including the debugging output. The first two runs are the same on the x86 and Sparc:

% ./assign_07 -S 5 -d -m 0.5 tsp5.txt
50 generations
1000 organisms
10 survive in each generation
2 survive randomly
0.5 probability of mutation
Generation 0: 8657.41
Total distance: 8657.41
% ./assign_07 -S 15 -d -m 0.5 tsp15.txt
50 generations
1000 organisms
10 survive in each generation
2 survive randomly
0.5 probability of mutation
Generation 0: 2175.42
Generation 1: 1875.53
Generation 2: 1623.9
Generation 3: 1501.38
Generation 4: 1481.76
Generation 5: 1459.5
Generation 6: 1439.88
Total distance: 1439.88
On the x86, the third run is:
% ./assign_07 -S 50 -d -m 0.5 tsp50.txt
50 generations
1000 organisms
10 survive in each generation
2 survive randomly
0.5 probability of mutation
Generation 0: 15232.9
Generation 1: 13513.8
Generation 2: 12140.4
Generation 3: 10537.1
Generation 4: 9254.08
Generation 5: 9043.5
Generation 6: 8405.13
Generation 7: 8041.08
Generation 8: 7592.47
Generation 9: 7196.44
Generation 10: 6627.14
Generation 11: 6430.52
Generation 12: 6064.1
Generation 13: 5604.52
Generation 14: 5543.29
Generation 15: 5415.08
Generation 16: 5291.31
Generation 17: 5164.77
Generation 18: 5025.49
Generation 19: 5023.07
Generation 20: 4882.69
Generation 21: 4773.38
Generation 22: 4727.13
Generation 23: 4675.62
Generation 24: 4572.27
Generation 25: 4467.9
Generation 27: 4415.64
Generation 28: 4381.66
Generation 29: 4349.21
Generation 30: 4283.03
Generation 31: 4240.98
Generation 32: 4210.4
Generation 33: 4208.93
Generation 34: 4094.23
Generation 35: 3890.84
Generation 36: 3829.2
Generation 37: 3829.2
Generation 38: 3829.2
Generation 39: 3786.61
Generation 40: 3778.44
Generation 41: 3744.7
Generation 42: 3736.53
Generation 43: 3736.53
Generation 44: 3684.42
Generation 45: 3675.56
Generation 46: 3668
Generation 47: 3668
Generation 48: 3665
Generation 49: 3637.12
Total distance: 3637.12
and on the Sparc it's:
%  ./assign_07 -S 50 -d -m 0.5 tsp50.txt
50 generations
1000 organisms
10 survive in each generation
2 survive randomly
0.5 probability of mutation
Generation 0: 15232.9
Generation 1: 13513.8
Generation 2: 12140.4
Generation 3: 10537.1
Generation 4: 9254.08
Generation 5: 9043.5
Generation 6: 8405.13
Generation 7: 8041.08
Generation 8: 7592.47
Generation 9: 7196.44
Generation 10: 6627.14
Generation 11: 6430.52
Generation 12: 5988.95
Generation 13: 5727.26
Generation 14: 5621.47
Generation 15: 5471.94
Generation 16: 5249.62
Generation 17: 5127.85
Generation 18: 4938.2
Generation 19: 4862.17
Generation 20: 4759.58
Generation 21: 4668.55
Generation 22: 4628.35
Generation 23: 4573
Generation 24: 4542.33
Generation 25: 4504.74
Generation 26: 4469.18
Generation 27: 4448.42
Generation 28: 4391.63
Generation 29: 4285.26
Generation 30: 4245.51
Generation 31: 4154.64
Generation 32: 4136.18
Generation 33: 4136.18
Generation 35: 4019.71
Generation 36: 3994.79
Generation 37: 3972.97
Generation 38: 3948.05
Generation 39: 3948.05
Generation 40: 3934.75
Generation 41: 3854.51
Generation 42: 3843.43
Generation 43: 3843.43
Generation 44: 3821.27
Generation 45: 3811.55
Generation 46: 3811.55
Generation 48: 3796.62
Generation 49: 3796.62
Total distance: 3753.35

Note 1: you can think of the running time of the program as O(population size * number of generations). Don't use huge numbers or you'll wait all day! (You may want to try to analyze the complexity of the program yourself to determine whether the correct bound is different.)

Note 2: If you don't specify the -S switch, you will get different results every time you run the program. That's a feature, not a bug.

Note 3: The defaults are:

Tricky Stuff

As usual there are some tricky parts to this assignment. Some of them are:


© 2004, Geoff Kuenning

This page is maintained by Geoff Kuenning.