The program for this assignment, and everything else except the README file, is due at 9 P.M. on Wednesday, March 31st, 2004. As usual, the README file is due at midnight the same day (i.e., the moment that Thursday starts). Refer to the homework policies page for general homework guidelines.
The primary purpose of this assignment is to get you used to writing C++ iterators. You will also be developing a preliminary list class. Both the list class and the iterator for it will be useful to you in later assignments.
NOTE: the list class you develop in this assignment
will be central to assignments 8 and 9. MAKE
SURE you develop
it well and debug it thoroughly.
If you blow this assignment off, you will do poorly on the following
two assignments as well.
The problem description for this assignment is long. Much of it is background that is not critical for you to understand thoroughly. The short version of what you must do is:
You must make your list class work before you can try out the genetic
algorithm. We strongly suggest that you write small test
programs to help debug your list class before you try to plug it into
the assignment. For example, here is a trivial program that will let
you step through the pushTail
and popHead
functions with the debugger:
#include "intlist.hh" #include <cassert> int main(int, char *[]) { IntList foo; foo.pushTail(1); foo.pushTail(2); foo.pushTail(3); assert(foo.length() == 3); assert(foo.popHead() == 1); assert(foo.popHead() == 2); assert(foo.popHead() == 3); assert(foo.length() == 0); return 0; }
One of the more creative approaches to artificial intelligence is the genetic algorithm, invented by Prof. John Holland of the University of Michigan.
In brief, a genetic algorithm simulates the process of evolution by applying the usual rules of genetics to simulate natural selection. In real life, natural selection's primary goal is the continuation of the species, and organisms that achieve that goal tend to be propagated. In a genetic algorithm, on the other hand, the primary goal is to satisfy a "fitness function" chosen by the programmer. For example, a simple fitness function might interpret the genes of an organism as the value of x in a complicated equation. The natural-selection process could then be tuned to prefer organisms that generate an output near zero, so that the survivors would eventually produce a solution to the equation.
Genetic algorithms were the first step in the current research area called "artificial life," and they have been used to successfully solve many problems that were otherwise intractable.
In this assignment, we will create a program that uses a genetic algorithm to solve the Traveling Salesrep Problem (TSP), a well-known problem in computer science. Although it is simplified compared to a production implementation, the program demonstrates the basic outline and capabilities of a genetic algorithm.
There are three basic processes in evolution: mutation, crossover, and selection. Mutation involves selecting a gene site and modifying it in some fashion, usually by replacing it with another gene value. Mutation is very rare both in real life and in genetic algorithms. (However, the TSP benefits from having a higher-than-normal mutation right.) Crossover is the most important process in generating new organisms. It involves taking two gene strings (usually from two parent organisms), cutting them both at the same point, and re-splicing them so that the head of the result comes from one parent and the tail from the other. Real genetic algorithms usually generate two children in this process, and may splice at more than one point, but we'll simplify things in our implementation.
The final step, selection, involves evaluating the organisms according to some criterion (the "fitness function") and choosing the ones that are most successful. In real life, selection is the harsh process of "survival of the fittest." In a genetic algorithm, the same method is used: the least fit organisms are discarded (i.e., killed) without being allowed to reproduce. As in real life, there is some randomness, so that a somewhat unfit organism has a chance of surviving even when a more fit one is discarded. This randomness turns out to be important to the success of the method, since any two slightly unfit parents might (through crossover) generate an extremely fit child.
Because we will not have time to implement an entire genetic algorithm from scratch, much of the code has been provided for you, although you will have to finish it. You must also supply the underlying data structure (a linked list).
Organism::mutate
function, and
IntList
class (which can store a
singly-linked list of integers) to complete the program.
Ginny's program uses a number of data structures to help represent the problem:
City
Cities
Organism
Colony
IntList
To solve the TSP, we must know the distance between each pair of
cities. The City
class describes a single city, storing
its name, latitude, and longitude. There are I/O functions to read
and write the city as needed. The class also provides a
distanceFrom
function that calculates the great-circle
distance between two cities. (Question for consideration: would have
been better to implement this function by overloading the subtraction
operator?)
It is convenient to store the cities as an array, so that each city
can be referred to by an integer index. The
Cities
class provides a self-expanding array and a
way to read a list of cities into it. (One could also use the STL
vector
class to achieve the same effect. We have
provided our own version so that you will have a reference for how
automatic expansion can be done.)
An Organism
is at the heart of any genetic algorithm.
Traditionally, the organism's genes are represented as integers, and
the chromosomes are a sequence of integers. Ginny's program
represents that sequence using a linked list.
For the Traveling Salesrep problem, the chromosome is interpreted as a singly-linked list of city numbers, in the order in which the cities should be visited. This means that each integer should appear exactly once in the list. Achieving this goal is not easy in a genetic algorithm, and has been the subject of much research (for more information, see the Larrañaga paper referenced in the comments).
The Organism
supports functions to initialize a gene
list, calculate and compare fitnesses, and print organisms. It also
implements the two functions that create genetic diversity,
crossover
and mutate
.
A group of Organism
s is called a Colony
.
The Colony
class supports an indexing operator
([]
) to allow any organism to be accessed, plus several
functions that emulate evolution. The most important are
evolve
, which simulates a number of generations, and
naturalSelection
, which chooses the organisms that will
survive until the next generation. There is also a
findBest
function that returns the best organism created
so far.
An organism is represented entirely by its gene sequence, which
in turn is represented using a singly linked list. Each element
in the list will contain only a single integer (represented by the C++
type int
) between zero and the number of cities, plus a
link to the next element. The list must have a separate header that
is not a plain element, which means that you must implement two
classes (the header and the element, or node). The cleanest approach
is to make the element a nested private class of the header, so that
only the header (IntList)
is visible from outside. If
the element is a nested private class, you can choose to make its data
members public
or even declare it as a
struct
.
You are not allowed to use a doubly linked list in this assignment.
Your linked list must be named IntList
(so that it can be
used by the main driver program) and must support
the following operations at a minimum. Note that, since the code for
the genetic
algorithm has already been supplied, the function names cannot be
changed.
The functions needed for your list class are:
popHead
function that removes and returns the
first integer in the list.
pushTail
function that inserts a single
integer at the tail of the list. The declaration of this
function should be similar to the following:
void pushTail(int value);This function must operate in O(1) time, which implies that you must maintain a separate tail pointer for the list. You have already done a similar implementation in CS60.
length
function that operates in O(1) time.
Finally, you may find it helpful to implement a few other
standard list functions: pushHead
,
isEmpty
, and possibly popTail
. Some of
these functions will be useful in future assignments, and you will
find it much easier to do those assignments if you implement the
functions now, while your list class is simple, rather than waiting
until later when you have converted it into a templated class.
However, only the functions mentioned above are absolutely required.
You must also implement an iterator for IntList
, which
must be named IntListIterator
. The iterator must
support the following functions at a minimum:
IntList
to
be iterated over.
operator bool
that returns true
if the iterator is valid (i.e., the access function will work),
or false
if the iterator is expired.
operator++
).
operator*
that returns a
int&
(so that the integer in the current
position can be modified if necessary).
In addition, you may wish to support postincrement operator. A
reset
function (to reset the iterator to the beginning of
the list) could be useful in the future but isn't needed for this
assignment (the Organism
code achieves the same effect by
using the copy constructor
and assignment operator). It would not be appropriate to implement
operator->
, since int
is not a class.
You are provided with a number of files:
Makefile
assign_07.cc
city.hh
and city.cc
City
and Cities
classes.
colony.hh
and colony.cc
Colony
class.
organism.hh
and
organism.cc
Organism
class.
mtwist.c
, mtwist.h
, and
mtwist.3
nroff -man
mtwist.3 | less -s
. For CS70 purposes, you can
treat the random-number generator as an opaque package.
randistrs.c
, randistrs.h
, and
randistrs.3
tsp5.txt
tsp15.txt
tsp50.txt
all_us_cities.txt
As usual,
you must get these files by using "cs70checkout
hw07
".
You must create or modify the following files:
organism.cc
intlist.hh
IntList
and IntListIterator
classes. Note that both classes must be defined
by this file, either by placing both definitions in the
file, or by having it #include
whatever
file contains the remaining definitions.
intlist.cc
*.hh
*.cc
Since organism.cc
is provided to you, you must maintain
stylistic consistency when modifying that file. However, you are not
required to
use any specific coding style in the
other files that you create. Since you are creating them from scratch, any
good style is acceptable. In particular, you do not have to
match the style of assign_07.cc in those files.
Emacs
users may find it helpful to invoke C-c
. stroustrup RET
to choose the Stroustrup indentation style for
assign_07.cc
.
mutate
Function
Although the concept of a genetic algorithm is quite simple, it
requires a fair amount of support code. The TSP is additionally
complicated by the requirement that no genes be missing or
duplicated. For that reason, we have provided most of the code for
you. You only need to complete the mutate
function.
To mutate a chromosome, you must choose two genes at random. This
part of the code is provided for you. You must then search through
the gene list, replacing gene1
with gene2
and vice versa. For example, if the chromosome is {5,4,1,2,3},
gene1
is 4, and gene2
is 3, the result after
mutation should be {5,3,1,2,4}. Note that gene1
and
gene2
are actual gene values, not positions.
The genetic algorithm will work even without the mutation code; it just won't produce the same results as the samples. Because of that, you can test your integer list before you write the mutation.
static
Keyword
In the provided code, there is a function named
(Colony::compareFitness
) that is declared static
.
Putting static
in front of a function
declaration means "this function will not
be called on specific particular object." In other words if you have
a class Foo
with a static function bar
,
instead of writing:
Foo x; x.bar(3);you would write just:
Foo::bar(3);Usually, you need to use
Foo::
to specify which function you
are calling. A static function has no this
pseudo-variable.
The compareFitness
function is static because the
qsort
library routine needs a comparison function that
doesn't take objects.
As usual,
you must check out the provided files by using "cs70checkout
hw07
".
When you have a working solution, you must submit your files with
cs70submit
. If you create any new files, you need to
tell the submission system about them by mentioning them once on a
cs70submit
command line.
For convenience, we have provided dummy versions of
README
, intlist.hh
, and
intlist.cc
so that they will be sure to get submitted.
The provided Makefile contains a self-editing rule, which you can
invoke with "make depend". This rule will automatically modify the
Makefile to reflect dependencies among files. You should run "make
depend" immediately after checking out your files, and whenever you
add a #include
statement to any file.
In the future, we won't be giving you sample Makefiles, so you should
make sure you understand how make depend
works so that
you can duplicate it in subsequent assignments.
Testing is your responsibility. We will not provide exact test cases for you. You should test your program a number of times, under different conditions.
In its default condition, the program is nondeterministic (i.e., two successive runs may produce different results). To make testing easier, the program accepts a switch that makes it deterministic. If you use "-S n", where n is an integer, the random seed will be set to that value. Specifying the random seed will allow you to control the program's behavior so that you can reproduce bugs.
You may find it instructive to run the program with the
-d
switch, and to run it for many
different values of the -g
, -m
,
-p
, -r
, and -s
switches.
Judicious reading of the comments, together with experimentation, will
reveal the purpose of these switches and how they interact.
We will not limit ourselves to running only simple test cases.
You can expect that we will run stress tests in an
attempt to break your program. We strongly suggest that you attempt to
break it yourself, so that we won't be able to do so. In particular,
make sure you run it with input and parameters that will cause it to
take a fairly long time, and use "top
" to watch its size
over time. If it keeps growing, you have a memory leak.
To make it clearer how the program is used, here are some sample runs. First, we can solve a 5-city problem. The "%" represents the command prompt.
% ./assign_07 -S 12345 tsp5.txt Total distance: 8657.41If we start with a different random seed, we get the same result, because in this case we are finding the optimal solution:
% ./assign_07 -S 54321 tsp5.txt Total distance: 8657.41If we try both seeds with a larger problem, we get different answers. In this case, the answers also differ depending on the machine you use (the reasons have to do with the representation of floating-point numbers). On an x86 (Intel) machine, we get:
% ./assign_07 -S 12345 tsp50.txt; ./assign_07 -S 54321 tsp50.txt Total distance: 3694.28 Total distance: 3611.3while on Sparc (Turing) we get:
% ./assign_07 -S 12345 tsp50.txt; ./assign_07 -S 54321 tsp50.txt Total distance: 3687.05 Total distance: 3922.44A third seed gives a notably worse result. On the x86:
% ./assign_07 -S 95 tsp50.txt Total distance: 3874.1And on the Sparc:
% ./assign_07 -S 95 tsp50.txt Total distance: 3847.73Finally, we can change the number of generations (
-g
), the
mutation rate (-m
), the population size (-p
)
the selection pool size (-s
, which should be fairly small),
and the number of randomly-chosen survivors
(-r
, which should usually be very small), and run with
debugging (-d
) to watch things develop. On both the x86
and the Sparc we get:
% ./assign_07 -S 1 -g 10 -m 0.5 -p 100 -s 5 -r 0 -d tsp50.txt 10 generations 100 organisms 5 survive in each generation 0 survive randomly 0.5 probability of mutation Generation 0: 16358.2 Generation 1: 15409.9 Generation 2: 13619.5 Generation 3: 13018.2 Generation 4: 12414.5 Generation 5: 11845 Generation 6: 10082 Generation 7: 9919.01 Generation 8: 9663.63 Generation 9: 9302.05 Total distance: 8859.05
In most situations, a mutation rate of 0.5 would be disastrously high. In the TSP, however, it seems to work better than smaller values.
Here are three more sample runs to help you ensure that your program still runs correctly when you're working after you've finished writing the mutation code. Your output should match exactly, including the debugging output. The first two runs are the same on the x86 and Sparc:
% ./assign_07 -S 5 -d -m 0.5 tsp5.txt 50 generations 1000 organisms 10 survive in each generation 2 survive randomly 0.5 probability of mutation Generation 0: 8657.41 Total distance: 8657.41 % ./assign_07 -S 15 -d -m 0.5 tsp15.txt 50 generations 1000 organisms 10 survive in each generation 2 survive randomly 0.5 probability of mutation Generation 0: 2175.42 Generation 1: 1875.53 Generation 2: 1623.9 Generation 3: 1501.38 Generation 4: 1481.76 Generation 5: 1459.5 Generation 6: 1439.88 Total distance: 1439.88On the x86, the third run is:% ./assign_07 -S 50 -d -m 0.5 tsp50.txt 50 generations 1000 organisms 10 survive in each generation 2 survive randomly 0.5 probability of mutation Generation 0: 15232.9 Generation 1: 13513.8 Generation 2: 12140.4 Generation 3: 10537.1 Generation 4: 9254.08 Generation 5: 9043.5 Generation 6: 8405.13 Generation 7: 8041.08 Generation 8: 7592.47 Generation 9: 7196.44 Generation 10: 6627.14 Generation 11: 6430.52 Generation 12: 6064.1 Generation 13: 5604.52 Generation 14: 5543.29 Generation 15: 5415.08 Generation 16: 5291.31 Generation 17: 5164.77 Generation 18: 5025.49 Generation 19: 5023.07 Generation 20: 4882.69 Generation 21: 4773.38 Generation 22: 4727.13 Generation 23: 4675.62 Generation 24: 4572.27 Generation 25: 4467.9 Generation 27: 4415.64 Generation 28: 4381.66 Generation 29: 4349.21 Generation 30: 4283.03 Generation 31: 4240.98 Generation 32: 4210.4 Generation 33: 4208.93 Generation 34: 4094.23 Generation 35: 3890.84 Generation 36: 3829.2 Generation 37: 3829.2 Generation 38: 3829.2 Generation 39: 3786.61 Generation 40: 3778.44 Generation 41: 3744.7 Generation 42: 3736.53 Generation 43: 3736.53 Generation 44: 3684.42 Generation 45: 3675.56 Generation 46: 3668 Generation 47: 3668 Generation 48: 3665 Generation 49: 3637.12 Total distance: 3637.12and on the Sparc it's:% ./assign_07 -S 50 -d -m 0.5 tsp50.txt 50 generations 1000 organisms 10 survive in each generation 2 survive randomly 0.5 probability of mutation Generation 0: 15232.9 Generation 1: 13513.8 Generation 2: 12140.4 Generation 3: 10537.1 Generation 4: 9254.08 Generation 5: 9043.5 Generation 6: 8405.13 Generation 7: 8041.08 Generation 8: 7592.47 Generation 9: 7196.44 Generation 10: 6627.14 Generation 11: 6430.52 Generation 12: 5988.95 Generation 13: 5727.26 Generation 14: 5621.47 Generation 15: 5471.94 Generation 16: 5249.62 Generation 17: 5127.85 Generation 18: 4938.2 Generation 19: 4862.17 Generation 20: 4759.58 Generation 21: 4668.55 Generation 22: 4628.35 Generation 23: 4573 Generation 24: 4542.33 Generation 25: 4504.74 Generation 26: 4469.18 Generation 27: 4448.42 Generation 28: 4391.63 Generation 29: 4285.26 Generation 30: 4245.51 Generation 31: 4154.64 Generation 32: 4136.18 Generation 33: 4136.18 Generation 35: 4019.71 Generation 36: 3994.79 Generation 37: 3972.97 Generation 38: 3948.05 Generation 39: 3948.05 Generation 40: 3934.75 Generation 41: 3854.51 Generation 42: 3843.43 Generation 43: 3843.43 Generation 44: 3821.27 Generation 45: 3811.55 Generation 46: 3811.55 Generation 48: 3796.62 Generation 49: 3796.62 Total distance: 3753.35Note 1: you can think of the running time of the program as O(population size * number of generations). Don't use huge numbers or you'll wait all day! (You may want to try to analyze the complexity of the program yourself to determine whether the correct bound is different.)
Note 2: If you don't specify the
-S
switch, you will get different results every time you run the program. That's a feature, not a bug.Note 3: The defaults are:
As usual there are some tricky parts to this assignment. Some of them are:
organism.cc
, before
you start, so that you understand the requirements placed on
the IntList
and IntListIterator
classes.
IntList
destructor, copy
constructor, and assignment operator are working before you
try to run the main program. Getting these functions right
can be quite difficult, and if you don't debug them in
isolation, you will experience strange bugs that will be hard
to find. WRITE A TEST PROGRAM TO CHECK THEM!
operator*
) must return an integer by reference
(int&
). Otherwise
the mutation operator won't work.
pushTail
and length
must both run in O(1) time.
Be sure to do a careful complexity analysis of both functions
to be sure
that they aren't O(N). You will be penalized if they are not O(1).
© 2004, Geoff Kuenning
This page is maintained by Geoff Kuenning.