Evolutionary Learning: Genetic Algorithms & Knowledge Representations

G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 3: Data Mining Lecture 2: Evolutionary Learning

Outline of the lecture • Introduction and taxonomy • Genetic algorithms • Knowledge Representations • Paradigms • Two complete examples • GAssist • BioHEL • Resources

Evolutionary Learning • Usage of any kind of evolutionary computation methods (list follows) to machine learning tasks • Genetic Algorithms • Genetic Programming • Evolution Strategies • Ant Colony Optimization • Particle Swarm Optimization • Also known as • Genetics-Based Machine Learning (GBML) • Learning Classifier Systems (LCS) (subset of it)

Paradigms and representation • EL involves a huge mix of • Search methods (previous slide) • Representations • Learning paradigms • Learning paradigms: how the solution to the machine learning problem are generated • Representations: rules, decision trees, synthetic prototypes, hyperspheres, etc.

Genetic Algorithm working cycle Population A Population B Evaluation Mutation Selection Population D Population C Crossover

Genetic Algorithms: terms • Population • Possible solutions of the problem • Traditionally represented as bit-strings (e.g. each bit associated to a feature, indicating if it is selected or not) • Each bit of an individual is called gene • Initial population is created at random • Evaluation • Giving a goodness value to each individual in the population • Selection • Process that rewards good individuals • Good individuals will survive, and get more than one copy in the next population. Bad individuals will disappear

Genetic Algorithms • Crossover • Exchanging subparts of the solutions • The crossover stage will take two individuals from the population (parents) and with certain probability Pc will generate two offspring 1-point crossover uniform crossover

Knowledge representations • For nominal attributes • Ternary representation • GABIL representation • For real-valued attributes • Hyperrectangles • Decision tree • Synthetic prototypes • Others

Ternary representation • Used by XCS (Michigan LCS) • Three-letter alphabet {0,1,#} for binary problems • # means “don’t care”, that is, that the attribute is irrelevant • If A1=0 and A2=1 and A3 is irrelevant  class 0 • For non-binary nominal attributes: • {0,1, 2, …, n,#} • Crossover and mutation act as in a classic GA 01#|0

GABIL representation • Predicate  Class • Predicate: Conjunctive Normal Form (CNF) (A1=V11..  A1=V1n) ..... (An=Vn2..  An=Vnm) • Ai : ith attribute • Vij : jth value of the ith attribute • The rules can be mapped into a binary string 1100|0010|1001|1 • 2 Variables: • Sky = {clear, partially cloudy, dark clouds} • Pressure = {Low, Medium, High} • 2 Classes: {no rain, rain} • Rule: If [sky is (partially cloudy or has dark clouds)] and [pressure is low] then predict rain • Genotype: “011|100|1”

Hyper-rectangle representation • The rule’s predicate encodes an interval for each of the dimensions of the domain, effectively generating an hyperrectangle • Different ways of encoding the interval • X< value, X> value, X in [l,u] • Encoding the actual bounds (UBR, NAX) • Encoding the interval as center±spread (XCSR) • What if the u<l ? • Flipping them (UBR) • Declaring the attribute as irrelevant (NAX) If (X<0.25 and Y<0.25) then 

Decision tree representation • Each individual literally encodes a complete decision tree [Llora, 02] • Only suitable for the Pittsburgh approach • Decision tree can be axis-parallel or oblique • Crossover • Exchange of sub-branches of a tree between parents • Mutation • Change of the definition of a node/leaf • Total replacement of a tree’s sub-branch

Synthetic Prototypes representation [Llora, 02] • Each individual is a set of synthetic instances • These instances are used as the core of a nearest-neighbor classifier 1 • (-0.125,0,yellow) • (0.125,0,red) • (0,-0.125,blue) • (0,0.125,green) Y 0 1

Other representations for continuous problems • Hyperellipsoid representation (XCS) • Each rule encodes an (hyper)ellipse over the search space • Smooth, non-linear, frontiers • Arbitrary rotation • Encoded as • Center • Stretches across dimensions • Rotation angles • Neural representation (XCS) • Each individual is a complete MLP, and evolution can change both the weights and the network topology

Learning Paradigms • Different ways of generating a solution • Is each individual a rule, a rule set? • Is the solution the best individual, or the whole population? • Is the solution generated in a single GA run • The Pittsburgh approach • The Michigan approach • The Iterative Rule Learning approach LCS

The Pittsburgh Approach • Each individual is a complete solution to the classification problem • Traditionally this means that each individual is a variable-length set of rules • The final solution is the best individual from the population after the GA run • Fitness function is based on the rule set accuracy on the training set (usually also on complexity) • GABIL [De Jong & Spears, 91] is a classic example

Pittsburgh approach: recombination • Crossover operator • Mutation operator: classic GA mutation of bit inversion Offspring Parents

The Michigan Approach • Each individual (classifier) is a single rule • The whole population cooperates to solve the classification problem • A reinforcement learning system is used to identify the good rules • A GA is used to explore the search space for more rules • XCS [Wilson, 95] is the most well-known Michigan LCS

The Michigan approach • What is Reinforcement Learning? • “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] • Rules will be evaluated example by example, receiving a positive/negative reward • Rule fitness will be update incrementally with this reward • After enough trials, good rules should have high fitness

Michigan system’s working cycle

Iterative Rule Learning approach • This approach implements the separate-and-conquer method of rule learning • Each individual is a rule • A GA run ends up generating a single good rule • Examples covered by the rule are removed from the training set, and process starts again • First used in evolutionary learning in the SIA system [Venturini, 93]

The Gassist Pittsburgh LCS [Bacardit, 04] • Genetic clASSIfier SysTem • Designed with three aims • Generate compact and accurate solutions • Run-time reduction • Be able to cope with both continuous and discrete data • Objectives achieved by several components • ADI rule representation (3) • Explicit default rule mechanism (1) • ILAS windowing scheme (2) • MDL-based fitness function (1) • Initialization policies (1) • Rule deletion operator (1)

GAssist components in the GA cycle • Representation • ADI representation • Explicit default rule mechanism • GA cycle Evaluation Selection Initialization Initialization Policies Standard operators • MDL fitness function • ILAS windowing Mutation Crossover

GAssist: Default Rule mechanism • When we encode this rule set as a decision list we can observe an interesting behavior: the emergent generation of a default rule • Using a default rule can help generating a more compact rule set • Easier to learn (smaller search space) • Potentially less sensitive to overlearning • To maximize this benefits, the knowledge representation is extended with an explicit default rule

GAssist: Default Rule mechanism • What class is assigned to the default rule? • Simple policies such as using the majority/minority class are not robust enough • Automatic determination of default class • The initial population contains individuals with all default classes • Evolution will choose the correct default class • In the first few iterations the different default classes will be isolated: each is a separate subpopulation • Different default classes learn at different rates • Afterwards, restrictions are lifted and the system is freely to pick up the best policy

GAssist: Initialisation policy • Initialization policy • Probability of a rule matching a random instance • In GABIL each gene associated to a value of an attribute is independent of the other values • Therefore the probability of matching an attribute equals to the probability of value 1 when initializing the chromosome • Probability of a rule set matching a random instance

GAssist: Initialisation policy • Initialization policy • How can we derive a formula to adjust P1 ? • We use an explicit default rule mechanism • If we suppose equal class distribution, we have to make sure that we match all but one of the classes

GAssist: Initialisation policy • Covering operator • Each time a new rule has to be created, an instance is sampled from the training set • The rule is created as a generalized version of the example • Makes sure it matches the example • It covers not just the examples, but a larger area of the search space • Two methods of sampling instances from the training set • Uniform probability for each instance • Class-wise sampling probability

GAssist: Rule deletion operator • Operator applied after the fitness computation • Rules that do not match any training example are eliminated • The operator leaves a small number of ‘dead’ rules in each individual, acting as protective neutral code • If crossover is applied over a dead rule, it does not matter, it will not break a good rule • However, if too many dead rules are present, exploration is inefficient, and the population loses diversity

GAssist: ILAS windowing scheme • Windowing: use of a subset of examples to perform fitness computations • Incremental Learning with Alternating Strata (ILAS) • The mechanism uses a different subset of training examples in each GA iteration 2·Ex/n 3·Ex/n 0 Ex/n Ex Training set Iterations 0 Iter

BioHEL [Bacardit et al, 09] • BIO-inspired HiErarchical Learning • Successor of GAssist, but changing paradigms: uses the Iterative Rule Learning approach • Created to overcome the scalability limitations of GAssist • It still employs • Default Rule (no auto policy) • ILAS windowing scheme

BioHEL: fitness function • Fitness function definition is trickier than in GAssist, as it is impossible to have a global control over the solution • As in any separate-and-conquer method, the system should favor rules that are • Accurate (do not make mistakes) • General (that cover many examples) • These two objectives are contradictory, specially in real-world problems: the best way of increasing the accuracy is by creating very specific rules • BioHEL redefines coverage as a piece-wise function, which rewards rules that cover at least a certain fraction of the training set

BioHEL: fitness function • Coverage term penalizes rules that do not cover a minimum percentage of examples • Choice of the coverage break is crucial for the proper performance of the system

BioHEL: ALKR • The Attributes List Knowledge Representation (ALKR) • This representation exploits a very frequent situation • In high-dimensionality domains it is usual that each rule only uses a very small subset of the attributes • Example of a rule for predicting a Bioinformatics dataset [Bacardit and Krasnogor, 2009] • Att Leu-2 [-0.51,7] and Glu  [0.19,8] and Asp+1 [-5.01,2.67] and Met+1 [-3.98,10] and Pro+2 [-7,-4.02] and Pro+3 [-7,-1.89] and Trp+3 [-8,13] and Glu+4 [0.70,5.52] and Lys+4 [-0.43,4.94]  alpha • Only 9 attributes out of 300 were actually in the rule

BioHEL: ALKR • Function match (instance x, rule r) Foreach attribute att in the domain If att is relevant in rule r and (x.att < r.att.lower or x.att > r.att.upper) Return false EndIf EndFor Return true • Given the previous example of a rule, 293 iterations of this loop are wasted !! • Can we get rid of them?

BioHEL: ALKR • ALRK automatically identifies the relevant attributes in the domain for each rule and just tracks them

BioHEL’s ALKR • Simulated 1-point crossover

BioHEL: ALKR • In ALKR two operators (specialize and generalize) add or remove attributes from the list with a given probability, hence exploring the rule-wise space of the relevant attributes • ALKR match process is more efficient, however exploration is costlier and it has two extra operators • Since ALKR chromosome only contains relevant information, the exploration process is more efficient.

BioHEL: CUDA-based fitness computation • NVIDIA’s Computer Unified Device Architecture (CUDA) is a parallel computing architecture that exploits the capacity within NVIDIA’s Graphic Processor Units • CUDA runs thousands of threads at the same time  Single Program, Multiple Data paradigm • In the last few years GPUs have been extensively used in the evolutionary computation field • Many papers and applications are available at http://www.gpgpgpu.com • The use of GPGPUs in Machine Learning involves a greater challenge because it deals with more data but this also means it is potentially more parallelizable

CUDA architecture

CUDA memory management • Different types of memory with different access speed • Global memory (slow and large) • Shared memory (block-wise; fast but quite small) • Constant memory (very fast but very small) • The memory is limited • The memory copy operations involve a considerable amount of execution time • Since we are aiming to work with large scale datasets a good strategy to minimize the execution time is based on the memory usage

CUDA for matching a set of rules • The match process is the stage computationally more expensive • However, performing only the match inside the GPU means downloading from the card a structure of size O(NxM) (N=population size, M=training set size) • In most cases we don’t need to know the specific matches of a classifier, just how many (reduce the data) • Performing the second stage also inside the GPU allows the system to reduce the memory traffic to O(N)

CUDA in BioHEL

Performance of CUDA alone • We used CUDA in a Tesla C1060 card with 4GB of global memory, and compared the run-time to that of Intel Xeon E5472 3.0GHz processors • Biggest speedups obtained in large problems (|T| or #Att), specially in domains with continuous attributes • Run time for the largest dataset reduced from 2 weeks to 8 hours

CUDA fitness in combination with ILAS • The speedups of CUDA and ILAS are cumulative

Resources • A very thorough survey on GBML is available here • Thesis of Martin Butz on XCS, including theoretical models and advanced exploration methods (later a book) • My thesis, about Gassist (code) • Complete description of BioHEL (code)

Questions?

Evolutionary Learning: Genetic Algorithms & Knowledge Representations

Evolutionary Learning: Genetic Algorithms & Knowledge Representations

Presentation Transcript

Data Mining: Concepts and Techniques Slides for Textbook Chapter 1

Statistical Data Mining - 3

Data Mining: Concepts and Techniques Mining Text Data

Data Mining : Commercial Applications

Data Mining

Data Mining Technique

Combing GLM and Data Mining Techniques Daniel Finnegan, President ISO Innovative Analytics

Advanced Topics in Data Mining: Web Mining

Course Overview

Parallel and Distributed Computing for Data Mining

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques — Chapter 1 — — Introduction —

Data Mining: Introduction

Data Mining: Introduction

CSC 478 Programming Data Mining Applications Course Summary

Data Mining: Concepts and Techniques

Data Mining: Applications

Data Mining