CS 478 - Machine Learning

CS 478 - Machine Learning Genetic Algorithms (I)

Darwin’s Origin of Species: Basic Principles (I) • Individuals survive based on their ability to adapt to the pressures of their environment (i.e., their fitness) • Fitter individuals tend to have more offspring, thus driving the population as a whole towards favorable traits • During reproduction, the traits found in parents are passed onto their offspring • In sexual reproduction, the chromosomes of the offspring are a mix of those of their parents • The traits of offspring are partially inherited from their parents and partially the result of new genes/traits created during the process of reproduction • Nature produces individuals with differing traits • Over long periods, variations can accumulate, producing entirely new species whose traits make them especially suited to particular ecological niches CS 478 - Machine Learning

Darwin’s Origin of Species: Basic Principles (II) • Evolution is effected via two main genetic mechanisms: • Crossover • Take 2 candidate chromosomes • “Randomly” choose 1 or 2 crossover points • Swap the respective components to create 2 new chromosomes • Mutation • Choose a single offspring • Randomly change some aspect of it CS 478 - Machine Learning

Intuition • Essentially a pseudo-random walk through the population with the aim of maximizing some fitness function • From a starting population: • Crossover ensures exploitation • Mutation ensures exploration • GAs are based on these principles CS 478 - Machine Learning

Individual Population Fitness Chromosome Gene Crossover and mutation Natural selection Candidate solution Set of candidate solutions Measure of quality of solutions Encoding of candidate solutions Part of the encoding of a solution Search operators Re-use of good (sub-)solutions Natural vs Artificial CS 478 - Machine Learning

Phenotype vs. Genotype • In Genetic Algorithms (GAs), there is a clear distinction between phenotype (i.e., the actual individual or solution) and genotype (i.e., the individual's encoding or chromosome). The GA, as in nature, acts on genotypes only. Hence, the natural process of growth must be implemented as a genotype-to-phenotype decoding. • The original formulation of genetic algorithms relied on a binary-encoding of solutions, where chromosomes are strings of 0s and 1s. Individuals can then be anything so long as there is a way of encoding/decoding them using binary strings. CS 478 - Machine Learning

Simple GA • One often distinguishes between two types of genetic algorithms, based on whether there is a complete or partial replacement of the population between generations (i.e., whether there is overlap or not between generations). • When there is complete replacement, the GA is said to generational, whilst when replacement is only partial, the GA is said to be steady-state. If you look carefully at the algorithms below, you will notice that even the generational GA gives only partial replacement when cloning takes place (i.e., cloning causes overlap between generations). Moreover, if steady-state is performed on the whole population (rather than on a proportion of fittest individuals), then the GA is generational. • Hence, the distinction is more a matter of how reproduction takes place than a matter of overlap. CS 478 - Machine Learning

Generational GA • Randomly generate a population of chromosomes • While (termination condition not met) • Decode chromosomes into individuals • Evaluate fitness of all individuals • Select fittest individuals • Generate new population by cloning, crossover and mutation CS 478 - Machine Learning

Steady-state GA • Randomly generate a population of chromosomes • While (termination condition not met) • Decode chromosomes into individuals • Evaluate fitness of all individuals • Select fittest individuals • Produce offspring by crossover and mutation • Replace weakest individuals with offspring CS 478 - Machine Learning

Genetic Encoding / Decoding • We focus on binary encodings of solutions • We first look at single parameters (i.e., single gene chromosomes) and then vectors of parameters (i.e., multi-gene chromosomes) CS 478 - Machine Learning

Integer Parameters • Let p be the parameter to be encoded. There are three distinct cases to consider: • p takes values from {0, 1, ..., 2N-1} for some N • Then p can be encoded directly by its equivalent binary representation • p takes values from {M, M+1, ..., M+2N-1} for some M and N • Then (p - M) can be encoded directly by its equivalent binary representation. • p takes values from {0, 1, ..., L-1} for some L such that there exists no N for which L=2N • Then there are two possibilities: clipping or scaling CS 478 - Machine Learning

Clipping • Clipping consists of taking N=log(L)+1 bits and encoding all parameter values 0 pL-2 by their equivalent binary representation, letting all other N-bit strings serve as encodings of p=L-1. • For example, assume p takes values in {0, 1, 2, 3, 4, 5}, i.e., L=6. Then N=log(6)+1=3. • Here, not only is 101 an (expected) encoding of p=L-1=5, but so are 110 and 111 • Advantages: easy to implement. • Disadvantages: strong representational bias, i.e., all parameter values between 0 and L-2 have a single encoding, whilst the single parameter value L-1 has 2N-L+1 encodings. CS 478 - Machine Learning

Scaling • Scaling consists of taking N=log(L)+1 bits and encoding p by the binary representation of the integer value e such that p = e(L-1)/(2N-1) • For example, assume p takes values in {0, 1, 2, 3, 4, 5}, i.e., L=6. Then N=log(6)+1=3. • Here, the binary encodings are not generally numerically equivalent to the integer values they code • Advantages: easy to implement and smaller representational bias than clipping (each value of p has 1 or 2 encodings, with double encodings evenly spread over the values of p) • Disadvantages: more computation needed and still a small representational bias CS 478 - Machine Learning

Real-valued Parameters (I) • Real values may be encoded as fixed point numbers or integers via scaling and quantization • If p ranges over [min, max], then p is encoded by the binary representation of the integer part of: CS 478 - Machine Learning

Real-valued Parameters (II) • Real values may also be encoded using thermometer encoding • Let T be an integer greater than 1 • Thermometer encoding of real values on T bits consists of normalizing all real values to the interval [0,1] and converting each normalized value x to a bit-string of xT (rounded down) 1s followed by trailing 0s as needed. CS 478 - Machine Learning

Vectors of Parameters • Vectors of parameters are encoded on multi-gene chromosomes by combining the encodings of each individual parameter • Let ei=[bi0, ..., biN] be the encoding of the ith of M parameters • There are two possibilities for combining the ei 's onto a chromosome: • Concatenating: Here, individual encodings simply follow each other in some pre-defined order, e.g., [b10, ..., b1N, ..., bM0, ..., bMN] • Interleaving: Here, the bits of each individual encoding are interleaved, e.g., [b10, ..., bM0}}, ..., b1N, ..., bMN] • The order of parameters in the vector (resp., genes on the chromosome) is important, especially for concatenated encodings CS 478 - Machine Learning

Gray Coding (I) • A Gray code represents each number in the sequence of integers 0, 1, ..., 2N-1 as a binary string of length N, such that adjacent integers have representations that differ in only one bit position • A number of different Gray codes exist. One simple algorithm to produce gray codes starts with all bits set to 0 and successively flips the right-most bit that produces a new string CS 478 - Machine Learning

Gray Coding (II) CS 478 - Machine Learning

Gray Coding (III) • Advantages: random bit-flips (e.g., during mutation) are more likely to produce small changes (i.e., there are no Hamming cliffs since adjacent integers' representations differ by exactly one bit). • Disadvantages: big changes are rare but bigger than with binary codes. • For example, consider the string 001. There are 3 possible bit flips leading to the strings 000, 011 and 101 • With standard binary encoding, 2 of the 3 flips lead to relatively large changes (from 001(=1) to 011(=3) and from 001(=1) to 101(=5), respectively) • With Gray coding, 2 of the 3 flips produce small changes (from 001(=1) to 000(=0) and from 001(=1) to 011(=2), respectively) • However, the less probable (1 out of 3) flip from 001 to 101 produces a bigger change under Gray coding (to 6) than under standard binary encoding (to 5) CS 478 - Machine Learning

GA operators • We will restrict our discussion to binary strings • The basic GA operators are: • Selection • Crossover • Mutation CS 478 - Machine Learning

Selection • Selection is the operation by which chromosomes are selected for reproduction • Chromosomes corresponding to individuals with a higher fitness have a higher probability of being selected • There are a number of possible selection schemes (we discuss some here) • Fitness-based selection makes the following assumptions: • There exists a known quality measure Q for the solutions of the problem • Finding a solution can be achieved by maximizing Q • For all potential solutions (good or bad), Q is positive. • A chromosome's fitness is taken to be the quality measure of the individual it encodes CS 478 - Machine Learning

Fitness-proportionate Selection • This selection scheme is the most widely used in GAs • Let fi be the fitness value of individual i and let favg be the average population fitness • Then, the probability of an individual i being selected is given by: CS 478 - Machine Learning

Roulette Wheel • Fitness-proportionate selection (FPS) can be implemented with the roulette-wheel algorithm • A wheel is constructed with markers corresponding to fitness values • For each fitness value fi, the size of the marker (i.e., the proportion of the wheel's circumference) associated to fi is given by pi as defined above • Hence, when the wheel is spun, the probability of the roulette landing on fi (and thus selecting individual i) is given by pi, as expected CS 478 - Machine Learning

Vector Representation • A vector v of M elements from {1, ..., N} is constructed so that each subsequent i in {1, ..., N} has M.pi entries in v • A random index r from {1, ..., M} is selected and individual v(r) is selected • Example: • 4 individuals such that f1=f2=10, f3=15 and f4=25 • If M=12, then v=(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4) • Generate r=6, then individual v(6)=3 is selected CS 478 - Machine Learning

Cumulative Distribution • A random real-valued number r in • is chosen and individual i, such that • is selected (if i=1, the lower bound sum is 0). CS 478 - Machine Learning

Discussion (I) • The implementation based on cumulative distribution is effective but relatively inefficient, whilst the implementation based on vector representation is efficient but its effectiveness depends on M (i.e., the value of M determines the level of quantization of the pi's and thus accuracy depends on M). • Assume that N individuals have to be selected for reproduction. The expected number of copies of each individual in the mating pool is: • Hence, individuals with above-average fitness tend to have more than one copy in the mating pool, whilst individuals with below-average fitness tend not to be copied. This leads to problems with FPS. CS 478 - Machine Learning

Discussion (II) • Premature convergence • Assume an individual X with fi >> favg but fi << fmax is produced in an early generation. As Ni >> 1, the genes of X quickly spread all over the population. At that point, crossover cannot generate any new solutions (only mutation can) and favg << max forever. • Stagnation • Assume that at the end of a run (i.e., in one of the consecutive generations) all individuals have a relatively high and similar fitness, i.e., fi is almost fmax for all i. Then, Ni is almost 1 for all i and there is virtually no selective pressure. • Both of these problems can be solved with fitness scaling techniques CS 478 - Machine Learning

Fitness Scaling • Essentially, fitness values are scaled down at the beginning and scaled up towards the end • There are 3 general scaling methods: • Linear scaling • f is replaced by f’=a.f + b, where a and b are chosen such that: • f’avg=favg (i.e., the scaled average is the same as the raw average) • f’max=c. favg (c is the number of expected copies desired for the best individual; usually c=2) • The scaled fitness function may take on negative values if there are a few bad individuals with fitness much lower than favg and favg is close to fmax . One solution is to arbitrarily assign the value 0 to all negative fitness values. • Sigma truncation • f is replaced by f'=f - (favg - c.), where  is the population standard deviation, c is a reasonable multiple of  (usually 1 c 3) and negative results are arbitrarily set to 0. Truncation removes the problem of scaling to negative values. (Note that truncated fitness values may also be scaled if desired) • Power law scaling • f is replaced by f'=fk for some suitable k. This method is not used very often. In general, k is problem-dependent and may require dynamic change to stretch or shrink the range as needed CS 478 - Machine Learning

Rank Selection • All individuals are sorted by increasing values of their fitness • Then, each individual is assigned a probability pi of being selected from some prior probability distribution • Typical distributions include: • Linear: Here, pi=a.i + b • Negative exponential: Here, pi=a.eb.i + c • Rank selection (RS) has little biological plausibility. However, it has the following desirable features: • No premature convergence. Because of the ranking and the probability distribution imposed on it, even less fit individuals will be selected (e.g., let there be 3 individuals such that f1=90, f2=7, f3=3, and pi=-0.4i + 1.3. With FPS, p1=0.9 >> p2=0.07 and p3=0.03, so that individual 1 comes to saturate the population. With RS, p1=0.9, p2=0.5 and p3=0.1, so that individual 2 is also selected). • No stagnation. Even at the end, N1N2 ... (similar argument to above). • Explicit fitness values not needed. To order individuals, only the ability of comparing pairs of solutions is necessary. • However: rank selection introduces a reordering overhead and makes a theoretical analysis of convergence difficult. CS 478 - Machine Learning

Tournament Selection • Tournament selection can be viewed as a noisy version of rank selection. • The selection process is two-stage: • Select a group of N ( 2) individuals • Select the individual with the highest fitness from the group and discard all others • Tournament selection inherits the advantages of rank selection. In addition, it does not require global reordering and is more naturally-inspired. CS 478 - Machine Learning

Elitist Selection • The idea behind elitism is that at least one copy of the best individual in the population is always passed onto the next generation. • The main advantage is that convergence is guaranteed (i.e., if the global maximum is discovered, the GA converges to that maximum). By the same token, however, there is a risk of being trapped in a local maximum. • One alternative is to save the best individual so far in some kind of register and, at the end of each run, to designate it as the solution instead of using the best of the last generation. CS 478 - Machine Learning

1-point Crossover • Here, the chromosomes of the parents are cut at some randomly chosen common point and the resulting sub-chromosomes are swapped • For example • P1=1010101010 and P2=1110001110 • Crossover point between the 6th and 7th bits • Then the offspring are: • O1=1010101110 • O2=1110001010 CS 478 - Machine Learning

2-point Crossover • Here, the chromosomes are thought of as rings with the first and last gene connected (i.e., wrap-around structure) • The rings are cut in two sites and the resulting sub-rings are swapped • For example • P1=1010101010 and P2=1110001110 • Crossover points are between the 2nd and 3rd bits, and between the 6th and 7th bits • Then the offspring are: • O1=1110101110 • O2=1010001010 CS 478 - Machine Learning

Uniform Crossover • Here, each gene of the offspring is selected randomly from the corresponding genes of the parents • For example • P1=1010101010 and P2=1110001110 • Then the offspring could be: • O=1110101110 • Note: produces a single offspring CS 478 - Machine Learning

Mutation • Mutation consists of making (usually small) alterations to the values of one or more genes in a chromosome • In binary chromosomes, it consists of flipping random bits of the genotype. For example, 1010101010 may become 1011101010 if the 4th bit is flipped. CS 478 - Machine Learning

CS 478 - Machine Learning