1 / 32

Bloat and Universal Consistency in GP

{ merve.amil, nicolas.bredeche, christian.gagne, sylvain.gelly, marc.schoenauer, olivier.teytaud } @lri.fr (TAO, Inria, LRI, University Paris-Sud) With thanks to William Langdon. Bloat and Universal Consistency in GP.

tuyet
Télécharger la présentation

Bloat and Universal Consistency in GP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. { merve.amil, nicolas.bredeche, christian.gagne, sylvain.gelly, marc.schoenauer, olivier.teytaud } @lri.fr (TAO, Inria, LRI, University Paris-Sud) With thanks to William Langdon Bloat and Universal Consistency in GP

  2. 1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline

  3. Framework : symbolic regression (we use GP for mining a space of Turing- computable functions that fit examples) Bloat and Universal Consistency in GP

  4. Examples : X1=[ 0.14 ; 2.07 ; -1 ] y1=0 X2=[ 1 ; 2 ; 31.5 ] y2=1 ... Xn=[ 1.5 ; -1 ; 10.0 ] yn=1 Hypothesis : (xi,yi) independent, identically distributed ==> law = law(x,y) unknown Goal : ==> finding f( ) Turing-computable (on real numbers) maximizing P( f(x)=y ) Symbolic Regression

  5. Framework : symbolic regression Consistency : is the function that we output a good function ? (it could only work on the examples !) No-bloat : does the function we output grows infinitely as the number of examples increase ? Bloat and Universal Consistency in GP

  6. What happens usually ? • We know P' the empirical distribution (average of the Dirac Masses at the (xi,yi) • We do not know P (ill-posed problem) • Two troubles : • Sometimes P(f(x)=y) is disappointing • Sometimes f is huge • We study the behavior as n  infinity

  7. What happens usually ? • We know P' the empirical distribution (average of the Dirac Masses at the (xi,yi) ) • We do not know P • We would like to maximize P(f(x)=y) • We can only maximize fitness = P'(f(x)=y) (with possibly complexity-penalization terms) Q: Assume that we perfectly optimize “fitness”. Does this lead to a good function ? Does this lead to no-bloat ?

  8. 1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline

  9. What questions do we want to answer ? • Universal Consistency : can we ensure that P(f(x)=y) ==> optimality as n  infinity ? (at least if an optimal f exists !) • Bloat : can we ensure that bloat does not occur, i.e. if a correct program f exists with bounded length, can we ensure that length(f) does not run to infinity ?

  10. 1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency (that does not work for bloat) 4. A penalized fitness against bloat 5. Conclusion Outline

  11. The usual tools for Universal Concistency • VC-theory, and more generaly statistical learning theory, can help us

  12. The usual tools for Universal Concistency • VC-bound (roughly) : for f in a family F of functions, | P( f(x) = y ) - P'( f(x) = y ) | < (VCdim(F) / n)1/2 = o(1) ==> consistency of learning in F • Unfortunately, GP works on F = { Turing-computable functions } ==> VC(F) = infinity

  13. Good news : classical application of VC • However, VCdim( { functions with bounded length } ) = finite ( under some constraints : bounded execution time) (see e.g. the book of Antony&Bartlett)

  14. Good news : classical application of VC • Th 1 : Slowly increase the bound on length so that (VC(F)/n)1/2 = o(1) ==> | P( f(x) = y ) - P'( f(x) = y ) | = o(1) once the length is sufficient to allow a good f, and once o(1) is small, P(f(x) = y) is good ! • This is classical in statistical learning

  15. Bad news : bad for bloat ! • Consider f'= argmax P'(x=y) among F with length < L(n) increasing to infinity as a function of n • Ok, if L increases slowly enough, P(f'(x)=y) ==> optimality • But we show that for some simple P, f' has length running to infinity, whenever a bounded-length-function is optimal

  16. The counter-example Three areas : P(y=1|x<-1) = 9/10 P(y=1|x>1) = 1/10 P(y=1| x > -1 and x < 1 ) = ½ X uniform in [-2,2] x X<-1  y probably=1 X>1  y probably=0 -1<X<1  P(y=1)=1/2

  17. The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP without penalization: fits to most examples !  No consistency, and bloat !

  18. The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with small penalization: fit to most examples !  Consistency, but bloat !

  19. The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with bigger penalization: ok !  Consistency, no bloat !

  20. The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with too strong penalization: ok !  No consistency, no bloat !

  21. 1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline

  22. A solution • Consider f'= argmax P'(f(x)=y) - penalization(f',n) • Choose penalization(f',n) just strong enough • Then, (i) universal consistency holds (ii) length --> minimal length of an optimal code

  23. The proof (sketch) (1/3) • Consider f'= argmax P'(f(x)=y) - penalization(VC(f),n) • We look for conditions on the penalization ensuring that P(f' ok)  maximum and length(f')  minimum ?

  24. The proof (sketch) (2/3) • f'= argmax P'(f(x)=y) - penalization(VC(f),n) • Assume that there exists some good f* (optimal in size) • Let’s show that for n large, f’= f*(roughly) • If we restrict our attention to functions with VC(f)<K(n), then |P’-P| =O( ( VC(f) / n )1/2) • So if we forbid functions with VC(f) > K(n) f'= argmax P(ok) – penalization + O( ( VC(f) / n )1/2)

  25. The proof (sketch) (3/3) • If penalization big in front of (VC(f)/n)1/2, f’= argmax P(ok) – penalization • K(n) increases slowly  penalization  0 P(ok)  minimum  Consistency ! P(f’ok) - pen(f’) ≥ P(f* ok) – pen(f*) + small terms • small terms ≥ P(f’ok)-P(f* ok) ≥ pen(f’) – pen(f*) • VC(f’) is VC(f*)+o(1)  optimal size

  26. 1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency (that does not kill bloat) 4. A penalized fitness against bloat 5. Conclusion Outline

  27. Conclusion • we use VC-theory to ensure consistency ; this is a standard application of VC-theory, but as far as we know it is new in GP ; • we use VC-theory to ensure no-bloat • for this, we have introduced a precise size-penalization • experiments confirm the results (thanks to openBeagle)

  28. Limits We deal with perfect fitness minimization, ie we consider that f'=argmax P'(f(x)=y) + …  the optimization of P'(.) is not so easy in practice !  however, optimizing P’(…)+… would be pointless if the ideal optimization of P’(…) does not work !

  29. Other elements in the paper • Can we use hold-out to choose a bound on the length of programs ? • Can we use cross-validation to choose a bound on the length of programs ?

  30. References • Many works proposing complexity penalization • Many work reporting bloat • Many papers explaining bloat (in particular « fitness causes bloat ») • Many good books about VC-theory  refs on the Dagstuhl site or on www.lri.fr/~teytaud

  31. A final remark Runtime analysis of EA uses Chernoff & Hoeffding & ... Statistical Learning is the generalization of Chernoff & Hoeffding bounds for the evaluation of distributions  Statistical learning is very promising e.g. for EDA-analysis

  32. Thanks for your attention We will be very grateful for any comment / suggestion / ... !

More Related