Lecture 4

Lecture 4 Before we try to incorporate the effects of schema creation to obtain “exact” theorems, rather than lower bounds, both for GAs and for GPs, we will look at some Genetic Algorithms results, trying to extend our understanding and the set of mathematical models we have at our disposal. To this effect, we will use some of the material in Mitchell’s book, Ch. 4.

Lecture 4 The Two-Armed Bandit Problem. This studies the trade-off between exploration (the search for new, useful, adaptations) and exploitation (the use and propagation of these exploitations). The problem itself goes back at least to the 1950’s (R. Bellman: Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961) or W. R. Thompson’s paper of 1933 (quoted in [MacreadyWolpert1996]) and can be described as follows: A gambler is given N coins with which to play a slot machine with two arms. (Conventional slot machine = “one-armed bandit”). The arms are labeled A1 and A2 and they have known mean per-trial payoff rates m1 and m2, with Gaussian distributions, and known standard deviations s1 and s2. The payoff distributions for the two arms are each stationary (= time-independent) and independent of one another: the payoff rates do not change over time.

Lecture 4 The gambler does not know which distribution is associated with which arm (although both distributions are known): the only avenue available for allocating the correct distribution to the correct arm involves estimating the allocation while playing coins on the different arms and observing the payoffs on each. There is no a-priori information about which arm is better (there could be, and it would make no real difference to the methodology). The goal is to maximize the total payoff during the N trials. What should the strategy be for allocating trials to each arm, given the current estimates (from the payoffs received so far) of the ms and ss? Note that the goal is not to just guess which arm has the higher payoff rate, but to maximize payoff while gaining information through the allocation of samples to the two arms.

Lecture 4 This kind of strategy (or criterion) is called an “on line strategy”, to be contrasted with an “off-line strategy” that would use the “information gathering period” to decide which of the two arms has the more favorable distribution. The two are clearly different: the latter is much easier to implement and evaluate than the former.

Lecture 4 The reason for the analysis of this problem was that John Holland wanted to claim that, as more and more information is obtained through sampling, the optimal strategy is to exponentially increase the probability of sampling the better-seeming arm relative to the probability of sampling the worse-seeming one: the 3L schemata in an L-bit search space can be viewed as the 3L arms of a multi-armed slot machine. The observed “payoff” of the schema is just its observed “average fitness” and, under the GA, the near-optimal strategy arises implicitly, leading to maximization of on-line performance. The analysis is the simplified analysis in Mitchell’s book - anybody interested in more detailed views should consult Holland’s analysis in the second edition of his book, or some of the papers in the web directory.

Lecture 4 Let A1 be the arm with higher average payoff, m1; let A2 be the arm with lower average payoff, m2. Let Ah(N, N - n) be the arm with observed higher payoff after N trials, N - n of which have been assigned to it; let Al(N, n) be the other arm, with n of the N trials assigned to it. To find: the value 0 ≤ n = n* ≤ N that maximizes the expected profits (or minimizes the expected losses) over the N trials. Observation 1: maximizing the expected profits would require us to allocate all N trials to the true best, resulting in an expected payoff of N•m1.

Lecture 4 Observation 2: there are two sources of profit loss: (1) the observed worse arm, Al(N, n), is actually the better arm, A1; in this case the gambler has lost (expected) profits on the N - n trials given to Ah(N, N - n) of (N - n)(m1 - m2). (2) The observed worse arm, Al(N, n), is actually the worse arm, A2; in this case the gambler has lost expected profits over the n trials given to Al(N, n), in the amount of n•(m1 - m2). Let q(N - n, n) denote the probability that the the observed worse arm, Al(N, n), is actually the better arm, A1, given N - n trials to Ah(N, N - n) and n trials to Al(N, n): q (N - n, n) = Pr(Al(N, n) = A1). Using q, we can compute the expected losses over the N trials: L(N - n, n) = q(N - n, n) •(N - n)•(m1 - m2) + (1 - q(N - n, n))•n•(m1 - m2).

Lecture 4 We now want to find n = n* that minimizes the expression. Even though n is not a continuous variable, we can differentiate w.r.t. n and set the derivative to 0: and solve for n. Unfortunately, q = q(N - n, n), a function of n, and not an easy function to compute explicitly - or at least explicitly enough so that we can find a solution of the equation above. We can go forward (a bit) by first supposing (without loss of generality) that Al(N, n) = A1 and then denoting by S1,n be the sum of the payoffs of the n trials given to A1, while denoting by S2, N-n the sum of the payoffs of the N - n trials given to A2.

Lecture 4 Since q (N - n, n) was defined as the probability that Al(N, n) = A1 (that the worse observed arm is the actual better one), the last set of definitions leads to the approximation Since S1,n and S2,N-n are random variables with well-defined distributions, their difference is a random variable with a well-defined distribution, and q(N - n, n) can be approximated by the area under the difference distribution corresponding to negative distribution values.

Lecture 4 A computation by Dan Frantz (quoted in Mitchell’s book as “personal communication”) using methods beyond us at the moment, and “correcting” the original “theorem” by Holland) gives (c1, c2 and c3 are constants): A bit of algebra leads to

Lecture 4 How do we interpret this? N - n* is just the optimal allocation of trials to the observed better arm. The term on the right-hand-side is dominated (for large n*) by the exponential, which now says that the optimal allocation of trials to the observed better arm should increase exponentially with the number of trials allocated to the observed worse arm (= n*). But this is another way of interpreting the Schema Theorem: to chase an optimal allocation we must have that, over time, the number of trials allocated to the best schemata increases exponentially with respect to the number of trials allocated to the worst ones.

Lecture 4 Comment: the original attempt at a proof (Theorem 5.1 in Holland’s book) is sufficiently flawed not to be worthwhile; the fix (in the later edition, pp. 181-183), is sufficiently telegraphic to be, essentially, worthless except to a specialist… The problem itself, and its variants, is still an active research problem with an evolving literature. A Google search will bring up quite a few papers published within the last few years. They mostly appear beyond a quick read at this moment, although [MacreadyWolpert1996] has some sections that are both readable and informative. Since it contains some interesting conclusions, we will be a little more explicit.

Lecture 4 Realizable Strategies. It should be clear that no implementable strategy exists that will allocate trials to Al(N, n). One might then argue that, even assuming this result to be valid, it is not very useful. [Holland1975, 1992] claims that (given the two distributions are known) for any N decided a priori, the strategy that assigns n* trials to each arm (in any order) and then N - 2n* trials to arm with highest observed payoff rate will be asymptotically (as N® ¥) optimal: the ratio between the expected profit loss following it and that obtained following the optimal strategy goes to 1 in the limit. Again, we will not examine the details.

Lecture 4 Interpretations. This is where things get complicated. It is certainly reasonable to use the 2-armed bandit problem as a paradigm problem in resource allocation under uncertainty. Its theoretical “conclusions” seem to give confirmation to the importance of the schema theorems, although there is plenty of “interpretation” and “analogy” linking the results (rather than hard mathematics). The GA implements this “exponential improvement” via “sampling in parallel”, where each of the n individuals in the population can be viewed as a sample of 2l different schemata (l-bit strings). The schema theorems are interpreted to imply that, since the number of instances in the population of a schema H is related to H’s average fitness (over the strings actually present in the finite population), we should have an exponential growth rate for highly fit schemata - and thus some kind of “optimality” for the strategy.

Lecture 4 The Critique of Macready and Wolpert [1996: On 2-Armed Gaussian Bandits and Optimization, Santa Fe Institute Working Paper]. This simply takes Holland’s analysis of the two-armed bandit and shows that the defining formulae are flawed from the beginning, followed by showing that straightforward greedy strategies (NO EXPLORATION and ONLY EXPLOITATION) are optimal in a large number of cases. Greedy Algorithm: after n = n1 + n2 trials, of which n1 have been allocated to arm A1 and n2 to arm A2, use the existing information to determine which arm has the higher probability of being the high payoff arm. Allocate the next trial to that arm. Repeat. Initial Action: if you have no a priori probability for which one is the higher payoff arm, pick one randomly.

Lecture 4 Theorem ([Macready and Wolpert 1996]): for gaussian distributions with means m1 and m2, if s1 = s2, the greedy algorithm is optimal. Proof. See the paper. This would be a meaningful exercise only if you are willing to spend a lot of time learning. The obvious question, then, becomes: what distributions lead to optimal greedy algorithms? Although Macready and Wolpert do not provide a complete answer, the result they give proves that, Theorem. If P denotes the prior probability that arm a (label the arms a and b) has mean m1 and standard deviation s1, and the a priori number of trials is N, the greedy strategy is optimal as long as P > 1/(N + 1) or P > N/(N + 1).

Lecture 4 There are two fairly important observations associated with this theorem: • The probability distributions are irrelevant. ANY distributions will do. • The estimates for P are much too pessimistic - although the proof (in its present form) does not allow an extension to larger domains. The final “nail in the coffin” is given by a Monte Carlo simulation comparing Holland’s claimed “asymptotically optimal” strategy with a greedy strategy. N was chosen = 100; P = 0.5. The results - for many choices of s1 and s2 - indicate that the greedy strategy will perform much better.

Lecture 4 Conclusion: 2- and multi-armed bandits may have something to contribute to our understanding of the performance characteristics of Genetic Algorithms, but we are far from being able to conclude what…

Lecture 4 The Critique of Grefenstette & Baker (How genetic algorithms work: A critical look at implicit parallelism; in J. D. Schaffer, ed., Proceedings of the third international conference on genetic algorithms, Morgan Kaufmann, 1989). They start from the fitness function Let u(H) be the static average fitness of schema H (the average over all instances of the schema in the search space), and let be the observed average fitness of H at time t. Since only one fourth of the instances of 1*…* have the form 111*…*, the expected fitness of 1*…* is given by u(1*…*) = 2•(1/4) = 1/2. Since all instances of 0*…* have the form 0*…*, u(0*…*) = 1•1 = 1.

Lecture 4 So u(1*…*) < u(0*…*), and this should be reflected in the observed averages as t increases. It is not likely to be: under selection, the schema 111*…* will come to dominate, and all instances of 111*…* are instances of 1*…*, so 1*…* will also come to dominate, and Why doesn’t this work? Because in the two-armed-bandit, each arm is an independent random variable; the schemata of the GA are not. There are other interpretations that attempt to “fix” this problem - none of them appear adequate.

Lecture 4 Deception, and the notion of GA-hardness (Goldberg, 1989, pp. 46-50). What makes search hard in the context of GAs? In particular, what, if anything, would have short, low-order schemata lead to incorrect “solutions”? Incorrect solutions, in this case, would mean “stopping far from the optimum” after any reasonable amount of computation, since the longer higher order building blocks generated by the application of the GA are incorrect (= suboptimal). As it turns out, 2-bit problems can lead to deception… Another example of GA-hard/easy problems, and a discussion, can be found in [Altenberg 1997].

Lecture 4 Suppose we have a set of order-2 schemata over two defining positions, each schema with an associated fitness value: ***0*****0* f00 ***0*****1* f01 ***1*****0* f10 ***1*****1* f11 with O(H) = 2 and L(H) = 6, and f00, f01, f10 and f11 the average fitness values of the four schemata, assumed constant with zero variance (we are interested only in expected performance). Assume f11 is the global maximum, f11 > max(f00, f01, f10). Symmetry considerations make the specific choice of maximum irrelevant to our conclusions.

Lecture 4 What is the deception? We want a problem where one or both of the suboptimal order-1 schemata are better than the optimal order-1 schemata. This means that we want one or both of the conditions f(0*) > f(1*), f(*0) > f(*1) to hold, where the notation shows only the two allele positions of interest. These give the inequalities: And they are contradictory to the hypothesis (they imply f11 < f00). We choose one of the inequalities, f(0*) > f(1*), and drop the other. This will be the “deception condition”.

Lecture 4 We normalize the fitness values w.r.t. the fitness of the complement of the global optimum: The globality and deception conditions can be rewritten, respectively, as: r > c, r > 1, r > c’ and r < 1 + c - c’, which lead to the inequalities c’ < 1, c’ < c. We conclude that there are two types of deceptive problems: Type I : f01 > f00 (c > 1); Type II : f00 ≥ f01 (c ≤ 1). In biological terms: an epistatic problem (= effects are not additive). Furthermore, this is a (the) minimal deceptive problem (MDP) - since no one-bit problem can be deceptive.

Lecture 4 It can be shown that neither case can be written in the form a linear combination of the allele values. We can obtain a pictorial understanding of the configurations:

Lecture 4 The pictures: Type I deception on left, Type II on right (from Goldberg, p. 48).

Lecture 4 We will now perform a full schema analysis of this problem. Using the schema theorem (under the no-mutation assumption), we expect difficulties when the inequality below holds: The first thing we will do is look at the exact effects of cross-over, and then derive a family of difference equations that will allow us to follow the “trajectory” from any initial populations. In order to do so, we must examine the effects of cross-over in detail, accounting for losses and gains of genetic material.

Lecture 4 The table (S denotes that the offspring is identical to the parents: two children per mating, complementary cross-over).

Lecture 4 …

Lecture 4 A necessary condition for convergence is that The family of non-linear autonomous (= time-independent) difference equations can be solved numerically for a number of initial populations, leading to different outcomes.

Lecture 4 The convergent Type II case:

Lecture 4 The divergent Type II case:

Lecture 4 Royal Roads. Forrest, Holland and Mitchell (1992 to 1994 - and discussed in Mitchell’s An Introduction to Genetic Algorithms) designed a family of “fitness landscapes” to study details of the Building Block Hypothesis. Two features appear important: a) the presence of short, low-order, highly fit schemata; b) the presence of intermediate “stepping stones” (intermediate order higher fitness schemata that result from combination of the lower-order schemata and that can combine to form even higher fitness schemata). We can design fitness functions that contain these features: a Royal RoadR is a fitness function that explicitly uses a sequence s1, …, of schemata, with corresponding coefficients ci, and such that

Lecture 4 If the building block hypothesis were valid, one would expect that the building-block structure of R will provide a “royal road” for the GA in its search of an optimal string. One might even expect that the GA will outperform other search algorithms. A last decision involves the reproductive scheme used: the expected number of offspring for an individual i is given by where Fi is i’s fitness, is the mean fitness of the population, and s is the standard deviation of the mean fitness of the population. The number of expected offspring was cut off at 1.5 (to avoid premature convergence); the single-point crossover probability was 0.7 for each pair of parents; the bitwise mutation rate was 0.001. As it turned out, the best behavior was exhibited by what the authors called a “Random Mutation Hill-Climbing” algorithm.

Lecture 4 Random Mutation Hill-Climbing (RMHC). Steps: • Choose a string at random. Call it best. • Choose a locus (gene) at random to flip. If the flip gives a string of equal or better fitness, set best to this new string. • Go to Step 2 until either you find an optimum, or you exceed the computational resources allocated for the search. • Return the current value of best. With a set of schemata (s1, … s8) consisting of eight contiguous blocks of 1s (64 bit chromosomes), and ci = order_of(si) (= 8), the mean number of function evaluations for the GA (population size 128; 200 independent runs stopping when the maximum was found) was 61,334 (mean over 200 runs); RMHC succeeded in 6179.

Lecture 4 Analysis of RMHC. (Based on an example in W. Feller’s Introduction to Probability Theory and its Applications, p. 210, 2nd ed.). Let E(K, N) be the expected time (say, number of function evaluations) to find the optimum string of all 1s, where K is the size of the block of 1s, and N is the number of blocks. Let E(K, 1) be the expected time to find a single block of K 1s. The time E(K, 2) is the expected time to find two blocks which must include the time to find a first block plus the time to find the second block given that the first was found. Note that after finding the first block, application of random mutation over N blocks implies that, 1/N of the time we can expect the mutation (flipping a 1 to a 0 in the block found) to be fitness reducing.

Lecture 4 The rest of the mutations are going to be either neutral or fitness enhancing, so that the percentage of the time spent on useful mutations will be (K•N - K)/(K•N) (number of usefully changed bits divided by the number of bits available). So the expected time spent to find another block must be E(K, 1)•(K•N/(K•N - N)) = (expected time to find a block)•(expected percentage of time required now, given that only (K•N - K)/(K•N) = (N - 1)/N percent of the time is spent in useful pursuits): E(K, 2) = E(K, 1) + E(K, 1)•(N/(N - 1)). A similar discussion leads to E(K, 3) = E(K, 2) + E(K, 1)•(N/(N - 2)) = E(K, 1) + E(K, 1)•(N/(N - 1)) + E(K, 1)•(N/(N - 2)) = E(K, 1)•(1 + N/(N - 1) + N/(N - 2)).

Lecture 4 Finally E(K, N) = E(K, 1)•(1 + N/(N - 1) + N/(N - 2) + … + N/(N - (N -1))) = E(K, 1)•N•(1 + 1/2 + 1/3 + … + 1/N) = E(K, 1)•N•HN, where HN is the Nth Harmonic Number. As has been well-known since the late 18th century (Leonard Euler), Euler’s Constant. One can easily check that the approximation is, roughly, within 0.007 with N = 64, so that HN can be replaced by ln(N) + g without too much loss. One can show (we won’t) that E(K, 1) is, approximately, 2K, slightly above 2K. As a very rough approximation: there are N distinct blocks of 1s that are of interest, and each block of 1s can occur in 2K•(N - 1) ways.

Lecture 4 The total number of K•N-length strings is just 2K•N, so the percentage of strings containing such bocks should be around N• 2K•(N - 1)/ 2K•N = N•2-K. The expected time to randomly hit one such string should be of the order of 2K/N. The fact that we are changing one bit at a time, from an initial random string, should not alter this estimate by much. For K = 8, one can compute (value given without derivation in Mitchell’s book) E(8, 1) = 301.2 > 256. Applying these results and formulae to the data, one ends up with an expected time quite close to that obtained experimentally.

Lecture 4 Why does the genetic algorithm do worse than “random bit twiddling”? Once an instance of a higher order schema is discovered, its high fitness allows it to spread quickly in the population, but it does not do much for the removal of zeros in the remaining unfilled blocks. This appears to slow the discovery of schemata in other positions: one ends up limiting the “implicit parallelism” that the building block hypothesis depends on…

Lecture 4 Mitchell further introduces - to simplify the ideas involved and to estimate a “best possible situation” - an “Idealized Genetic Algorithm”, or IGA. The Algorithm goes as follows: • On each time step, choose a new string at random, with uniform probability for each bit. • The first time a string is found that contains one or more of the desired schemata, save the string. • When a string containing one or more not-yet-discovered schemata is found, cross the new string with the saved one, and save the result of the crossing.

Lecture 4 What does this do for us? • Since each new string is chosen independently, all the schemata are sampled independently. • Selection is modeled by saving strings that contain desirable schemata. • Crossover is modeled by crossing desirable strings into a new, more desirable one. If we can find an estimate on the time it takes for this algorithm to converge, we should have a reasonable lower bound for convergence of the regular GA. Note that we really know what we are looking for…

Lecture 4 Consider a single desired schema H; let p be the probability of finding it on a random string (if N = 1, the probability is 1/2K). Let q = 1 - p = the probability of not finding H. Let P1(t) be the probability that H will be found by time t: P1(t) = 1 - qt. Consider the case with N desired schemata. Independence implies that the probability of finding all N is given by PN(t) = (1 - qt)N. PN(t) is the probability that all N schemata will be found in the interval [0, t], which is not quite what we want: we want the expected time to find all the N schemata. Thus we need the probability PN(t) that the last of the N schemata will be found exactly at time t. This is given by the formula PN(t) = PN(t) - PN(t - 1) = (1 - qt)N - (1 - qt-1)N.

Lecture 4 We can obtain the expected time from this probability distribution: Using the binomial theorem, we have: Subtracting the terms (the i = 0 terms cancel out): where the last sign depends on whether N is even or odd. Multiply by t and sum from 1 to ¥. Let’s perform the summation for the first term, that might be enough to give us a template for the summation for any term between 1 and N .

Lecture 4 > all the summation/differentiation trickery works because the infinite sums are uniformly convergent for any 0 ≤ q ≤ q0 < 1.

Lecture 4 A similar computation gives the n-th term of the sum: since q = 1 - p and p = 1/2N, we can assume p “small” in the rest of the discussion. In that case qn = (1 - p)n» 1 - n•p, and we have the approximation For any explicit N and p we can grind this out via Maple or Mathematica (if you don’t want to write your own version in C). We will make use of another obscure identity (hint from Mitchell: binomial theorem and integration of (1 + x)N):

Lecture 4 Letting x = -1 in the left hand side, which gives the negative of the prior large expression in brackets, we have (HN is the N-th harmonic number): which is a factor of N faster than Random-Mutation Hill Climbing.

Lecture 4 Another analysis of Royal Road based algorithms, that was motivated by results of Vose and others (which we mention briefly in later slides) can be found in [NimwegenCrutchfieldMitchell1998]. The algorithm itself uses only mutation, with fitness proportional selection. Some of its results seem to give credence to the idea of “punctuated equilibria” that had been championed by the late Stephen Jay Gould.

Lecture 4 A Formalization. Start with a random finite population P0 of binary strings of length l. M. Vose and G. Liepins developed a formal model that has led to substantial literature. The algorithm works as follows: • Calculate the fitness f(x) of each string x in Pi. • Use fitness proportional selection to choose (with replacement) two parents from Pi. • Cross over the parents at a single randomly chosen point, form two offspring and choose, randomly, one. If no cross-over, the offspring are exact copies of the parents: choose one. • Mutate each bit of the selected offspring with probability pm; put result in population. • Goto 2 until a new population Pi+1 is complete. • Goto 1, replacing i by i + 1..

Lecture 4 Only one offspring from each crossover survives: for a population of size n, a total of n recombinations take place. Each string in the search space is represented by an integer between 0 and 2l - 1 (using the standard binary-decimal conversion). The population at generation t is represented by two real-valued vectors and , each of length 2l. The i-th component of (pi(t)), is the proportion of the population at generation t consisting of string i, while the i-th component of (si(t)), is the probability that an instance of string i will be selected to be a parent at step 2 above. Example: let l = 2; P0 = {11, 11, 01, 10}. = (0, 0.25, 0.25, 0.5); = (0, 0.1667, 0.1667, 0.6667), if fitness = # of 1s in the string.

Lecture 4

Lecture 4

Presentation Transcript

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

LECTURE # 4

Lecture 4

Lecture 4

LECTURE 4

LECTURE 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

LECTURE № 4