ec.eval.README Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of ecj Show documentation
ECJ, A Java-based Evolutionary Computation Research System. ECJ is a research EC system written in Java. It was designed to be highly flexible, with nearly all classes (and all of their settings) dynamically determined at runtime by a user-provided parameter file. All structures in the system are arranged to be easily modifiable. Even so, the system was designed with an eye toward efficiency. ECJ is developed at George Mason University's ECLab Evolutionary Computation Laboratory. The software has nothing to do with its initials' namesake, Evolutionary Computation Journal. ECJ's sister project is MASON, a multi-agent simulation system which dovetails with ECJ nicely.
The newest version!
This directory contains the code for running the ECJ master/slave
evaluator.  The master/slave evaluator allows you to connect one ECJ
evolution process with some N remote slave processes.  These processes
can come on-line at any time, and you can add new processes at any time,
perhaps if new machines come available, or to replace slaves which have
died for some reason.  If a remote slave dies, ECJ gracefully handles
it, rescheduling its jobs to the next available slave.

Slaves run in single-threaded mode, and so have a single random number
generator each.  A slave receives a 32-bit random number seed from the
master.  Initially the master selects a random number seed based on
current wall clock time.  Each time a slave goes on-line, the master
increments this current seed and gives the slave the revised seed.
Slaves use this seed regardless of their seeds in their parameter files.

You can use the master/slave evaluator freely in conjunction with island
models to connect each island to its own private group of N slaves,
though we have no examples for that in the distribution.

Typical params files for the master and for the slaves are illustrated
in the ec/app/star directory.

You fire up the master something like this:

java ec.Evolve -file foo.master.params

	(where foo.master.params might have parent.0 be normal ec
         parameters, and have parent.1 = master.params)

You fire up each of the N slaves something like this:

java ec.eval.Slave -file foo.slave.params

	(where foo.slave.params might have parent.0 be normal ec
         parameters, and have parent.1 = slave.params)

...and it should all nicely work!  The system works fine under
checkpointing as far as we know.





MASTERS AND SLAVES

The master and slave processes can (and generally ought to) share
parameter files.  The way a slave knows it's a slave is through the
addition of the following *slave-only* parameter:

eval.i-am-slave = true

The master sets up distributed evaluation by loading special class,
called a MasterProblem, which replaces the Problem during evaluation. 
The MasterProblem is defined by the following *master* parameter:

eval.masterproblem = ec.eval.MasterProblem

When the Evaluator is started up, it normally sets up the Problem class. 
If eval.masterproblem is set, the Evaluator also loads the master
problem and then *replaces* the Problem class prototype with a prototype
of the MasterProblem class.  The Problem prototype is then set to be the
'problem' instance variable in the MasterProblem prototype.  This
essentially allows the Problem to stick around even though it's never
called by the Evaluator any more -- instead, the MasterProblem is
called.

The MasterProblem's job is to send stuff to remote slaves.  Thus a slave
should not load it, but instead should load the regular Problem
instance.  The slave does this by checking the eval.i-am-slave
parameter.  If it's true, it simply ignores the eval.masterproblem
parameter entirely.

More information about the architecture may be found near the end of
this file.

The master listens on socket for new slaves to arrive and register
themselves. When slaves are fired up, they attempt to attach to this
socket and negotiate connection.  This means that both the master and
the slave needs to know the master's port number, and furthermore the
slave needs to know the master's IP address.  The socket port number is
specified in the *master and slave* parameter (here set to 15000):

eval.master.port = 15000

The slave is told where the master is with the following *slave*
parameter:

eval.master.host = put.the.master.ip.address.here

The master and slave can communicate over a compressed stream.  
Communication is by default compressed.  Note that Java's
compression routines are broken (they don't support PARTIAL_FLUSH), so we
have to rely on a third-party library called JZLIB.  You can get the jar
file for this library on the ECJ main web page or at
http://www.jcraft.com/jzlib/
This is specified in the *master and slave* parameter:

eval.compression = true 

Last, the slave can be given a name.  This is solely for debugging
purposes.  If you don't provide this parameter, the slave will give
itself an arbitrary name, and that's fine.  The *slave* parameter is:

eval.slave-name = put-my-name-here

The master can print out various debug information if you turn on the
following parameter:

eval.masterproblem.debug-info = false




JOBS

A slave receives chunks of individuals to evaluate and return as a
group. This chunk is called a "job".  If you're doing non-coevolutionary
evolution, you can specify how many individuals the slave should receive
at one time with the *master* parameter (here set to 100):

eval.master-problem.job-size = 100

If you have very small individuals, or fast evaluation, this makes
better use of network bandwidth as it sends them as a group, evaluates 
them all, and then returns them all.  This is because more individuals 
can get packed into a TCP/IP packet before it's sent out, minimizing
overhead.  However, the primary effect of changing the job-size is
to modify the "slave evolution" population size (see the next section).

Another way to improve network efficiency, particularly with very fast
jobs, is to fill the network buffer with as many jobs as you can fit.
Each slave mantains a queue of jobs that it's working on.  When the
master needs to hand a job to a slave, it looks for one whose queue is
not filled, and then puts the job in the queue.  Each of the slave
connections keep their TCP/IP buffers stuffed with as many of these
queued jobs as they can, so jobs are available at the Slaves before
they even realize it.  To set the queue size, you use the *master* 
parameter (here it's being set to 100):

eval.masterproblem.max-jobs-per-slave = 100

If you only have 100 individuals in your population (say), this 
won't fill all the jobs on one slave connection.  The system goes
round-robin through all the slaves and distributes jobs appropriately.
Even so, if you have new slaves coming online all the time, they'll
have to wait if jobs have already been measured out to the other
slaves, so in that case it might be wise to keep the max-jobs-per-slave
a bit low.

If you are doing coevolutionary evolution, a job will consist of the
individuals necessary to perform one joint coevolutionary evaluation.



SLAVE EVOLUTION

Slaves can operate in one of two modes: "regular" and "evolve".  In
regular mode, when a slave receives a job, it evaluates them and returns
them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL'
below).  In 'evolve' mode, the slave evaluates its individuals; but if
it has some extra time on its hands, it then treats the individuals as a
little population and does some evolution on it.  When time is up, it
returns the most recent individuals in its little individuals in lieu of
the original individuals.  This is particularly useful when your
evaluation procedure is very fast compared to the amount of time spent
reading and writing individuals over the network.  Note that this only
works in NON-coevolutionary evolution.

To turn on this feature in a slave, you set the following parameter in
the *slave*:

eval.run-evolve = true

You will also need to specify how long the slave should do its
evolution.  This is specified in wall clock time (in milliseconds) with
the following *slave* parameter, here specifying 6 seconds:

eval.runtime = 6000

Last, you'll need to turn this *slave* parameter on to get
"evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for
more information):

eval.return-inds = true

The size of this mini-"population" the slave is evolving is determined
by the job-size parameter discussed earlier, so you'll want to set it to
something appropriate.  Here again it's being set to 100:

eval.master-problem.job-size = 100







FITNESS VERSUS INDIVIDUAL

By default the master/slave system sends individuals from master to
slave, but only returns FITNESS values from slave to master, in order to
save on network bandwidth.  However it is possible that some evaluations
of individuals literally change them during evaluation.  If you have
done such an evil thing, you'll need to have the modified individual
shipped back to the master for reinsertion.  Be sure to change the
appropriate *slave* parameter:

eval.return-inds = true

If you're running in "evolve" mode, you will *always* need to set this
parameter to true.



SLAVE CONFIGURATION

Because individuals don't know that they're being evaluated remotely,
it's best to make your Slave's EvolutionState structure, and particularly
its Population structure, look and feel as much like the Master as
possible.  Notably, Subpopulation classes, Species, Individual 
representations, and Fitnesses should be the same.  This is particularly
important if when you want to do evolution on the Slave.  The easiest way 
to do this is simply to derive the Slave's parameter file from the Master's
paramter file.  

However, if you're doing evolution on the Slave, this doesn't mean you have
to have the Slave set up the same way as the Master: just the Population 
and objects hanging off of it.  It might be preferable to do (for example) 
generational evolution on the Slave while asynchronous steady state 
evolution is happening on the Master.  You're free to change the high-level
evolution procedures; but you should maintain the same representations and
breeding mechanisms so that Individuals on the Slave are valid on the Master
when they come back.

If the Master and Slave share parameters, what prevents the Slave from
trying to set up its own MasterProblem as well?  The answer: the i-am-slave
parameter.  The Evaluator class will not set up a MasterProblem, even if one
is specified, if i-am-slave is true.




ARCHITECTURE

The system's classes naturally break into two groups: the Slave class
and the various master-side classes (all others).  The Slave class is
essentially a replacement for Evolve.java which sets up a dummy ECJ
process in which to evaluate individuals.  This means, of course, that
the Slave must have the same basic evolutionary parameters --
particularly representation parameters -- as the master evolutionary
process.

The master system is set up by MasterProblem, a special version of the
Problem class which, instead of evaluating individuals, sends them to
remote Slaves to be evaluated.  As mentioned before, MasterProblem is
loaded in the master process and then replaces the original Problem
instance, in essence "pretending" to be a Problem instance. 

Like any Problem, multiple MasterProblems are created during the course
of evaluation, one per thread and per generation.  During setup, the
MasterProblem prototype constructs one shared class which acts as the
clearing house for remotely evaluating individuals handed it by the
various MasterProblems.  This class is called the SlaveMonitor.

The SlaveMonitor is responsible for keeping track of the remote
slave connections.  For each Slave which has connected, the SlaveMonitor
manages reading and writing to that slave via a SlaveConnection.  The
SlaveConnection holds the job queue for that remote Slave, holds the
socket connection and streams to the remote Slave, and runs, in its own
thread, a worker thread which reads and writes jobs to/from the Slave. 
MasterProblems submit jobs to the SlaveMonitor, which in turn 
distributes them to an available slave.

Most evaluation procedures can take advantage of this to provide a
degree of semi-asynchrony.  For example, SimpleEvaluator performs
per-thread evaluation in the following way:

	problem.prepareToEvaluate(...)
	for each individual
		problem.evaluate(individual,...)
	problem.finishEvaluating(...)

This protocol informs the underlying Problem that it is free to delay
actual evaluation of the requested individuals until
finishEvaluating(...) time.  This in turn allows a MasterProblem to bulk
up all the individuals as it likes.  The MasterProblem will then read in
a full job's worth of individuals, then send them out to a slave, then
read another job's worth of individuals, then send THEM out to another
slave, and so on.  When SimpleEvaluator calls finishEvaluating, the
remaining individuals are sent out as a (possibly short) job, and then
the MasterProblem blocks waiting until all the individuals have come
back.

The SteadyStateEvaluator performs evaluation in a different way:

	problem.prepareToEvaluate(...) // at the very beginning of evolution!
	loop forever
		if problem.canEvaluate(),
			create/breed individual
			problem.evaluate(individual, ...)
		individual = problem.getNextEvaluatedIndividual()
		if individual != null
			introduce individual to population
		sleep a tiny bit to avoid spin-waiting

Note that problem.finishEvaluating(...) is NEVER CALLED.  Here, if a
Slave's queue can accept another job, the MasterProblem returns true
from canEvaluate().  Each time problem.evaluate() is called, the
MasterProblem adds the individual to the job until the job is filled, at
which time it sends the job off to the Slave.  Likewise, if
problem.evaluatedIndividualAvailable() returns true from the Master
Problem -- indicating that a job has come back with individuals for the
SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next
individual in that job.  This is sort of like the SteadyStateEvaluator
loading individuals onto a bus, and sending it out, and then when busses
come back to the station, the SteadyStateEvaluator gets the individuals
one by one as they disembark from the bus. 

A note on how individuals come back from the slaves.  You'd think that
the indivdiuals are sent out, and either revised fitnesses are read back
and replace their old fitnesses; or new individuals replace the old
individuals.  But that's not the case.  This is because the
MasterProblem actually has no idea where the individuals are stored that
it's receiving.  Instead we have a bit of an odd way of doing it.

When a job is submitted to a Slave, we send the individuals off to the
Slave, and then clone all the individuals to 'scratch individuals' a
second array.  If fitnesses come back, for each individual i in the
scratch array, we call
scratchindividual[i].fitness.readFitness(inputStream).  If whole
individuals come back, for each individual i in the scratch array, we
call scratchindividual[i].readIndividual(inputStream).  We don't want to
call these functions on the original individuals because if the Slave
dies in the middle of transmission, the population's individuals are
corrupted and our whole evolutionary process is messed up.  That's why
we read into scratch individuals.

So how do we then get the scratch individuals copied into the originals,
if we don't know where the originals are stored?  ECJ does not have an
Individual.copyIndividualIntoMe(individual) function -- it'd be nice if
it did! Instead, we create a special buffered pipe, stored in
ec/util/DataPipe.java, and write a scratch individual into it, using
Individual.writeIndividual(pipeOutputStream). We then read from that
same pipe into the original individual, using
Individual.readIndividual(pipeInputStream). It's a hack but a pretty
one.



MASTER/SLAVE NETWORK PROTOCOL

The master maintains a thread which continually waits for new slaves
to come in.  When a slave does in, the master creates a SlaveConnection
to handle the slave communication.  The SlaveConnection creates a read
thread and a write thread to keep the pipeline to the slave filled with
Individuals.

You have the option of using compressed streams.  Unfortunately,
Java has broken compression -- it doesn't support "partial flush", which
is critical for doing compressed network streams.  To do compression you
will need to download the JZLIB library from the ECJ website or from
http://www.jcraft.com/jzlib/  and ECJ will detect and use it automatically.



-> unique number (starting at 0, progressing with new slaves)	writeInt
-> FLUSH
<- slave name							readUTF
	(note: the slave name can be anything, it doesn't have
	 to be unique.  However if you did not specify a slave
	 name in the parameter files, the slave will construct
	 one out of the concatenation of its IP address and
	 the unique number given it just now.)
-> random number generator seed					writeInt
	(note: the seed then increments by SEED_INCREMENT (7919)
	 and the Slave is free to use any integer values between
	 the seed and seed + SEED_INCREMENT - 1 to seed its
	 random number generators.  We hopve 7919 is enough :-)
-> FLUSH



-> job type							writeByte
	(Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or
	 Slave.V_SHUTDOWN)
If Slave.V_SHUTDOWN,
	break connection
If Slave.V_EVALUATEGROUPED,
	-> countVictoriesOnly					writeBoolean
-> number of individuals in job					writeInt
Loop for number of individuals in job,
	-> individual						writeIndividual
	-> update ind's fitness					writeBoolean
-> FLUSH



<- arbitrary byte (to block on)					readByte
Loop for number of individuals in job,
	<- returning individual type                            readByte
        	(Slave.V_INDIVIDUAL or Slave.V_FITNESS)
	If Slave.V_INDIVIDUAL,
		<- individual					readIndividual
	Else if Slave.V_FITNESS,
		<- evaluated?					readBoolean
		<- fitness					readFitness
	Else if Slave.V_NOTHING,
		(nothing is sent)
<- FLUSH

Note that the reader protocol does not tell the SlaveConnection how many
individuals there are in the job.  This is because the SlaveConnection
already knows: jobs are received in the order they were sent.

The slaves and slave connections shut down when the socket breaks
or when the Slave.V_SHUTDOWN signal was received.