Learning Bayesian Networks through evolution

Learning Bayesian Networks through evolution Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel

Outline What is a Bayesian Network? Competition overview The three dimensions genetic algorithm Adding the fourth dimension The Big picture References

Definition • A Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). • For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

A long story • You have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes. • You also have two neighbors, John and Mary, who have promised to call you at work when they hear the alarm. • John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too. • Mary, on the other hand, likes rather loud music and sometimes misses the alarm altogether. • Given the evidence of who has or has not called, we would like to estimate the probability of a burglary.

A short representation

Observations • In our algorithm, all the values of the network are known except the genre value, which we would like to estimate. • The variables in our algorithm are continuous and not Boolean (except the genre variable). • We divide the possible values of each variables into fixed size intervals. • The number of intervals is changed throughout the evolution. • We refer to this process as the discretization of the variable. • We refer to the Conditional Probability Table of each variable (node) as CPT

Naïve Bayesian Network

Bayesian Network construction • Once we determined the chosen variables (amount and choice), their fixed discretization and the structure of the graph, we can easily compute the CPT values for each of the nodes in the graph (according to the training set). • For each vector in the training set, we will update all the network’s CPTs by increasing the appropriate entry by one. • After this process, we will divide each value with the sum of its row (Normalization).

Exact Inference in Bayesian Networks • For each vector in the verification/test set, we compute six different probabilities (Multiplying the appropriate entries of all the network’s CPTs) and chose the highest one as the genre of this vector. • Each probability is for a different assumption on the genre variable value (Rock, Pop, Blues, Jazz, Classical and Metal). • We will discuss the issue of zeroes in the CPTs later on.

Competition overview • A database of 60 music performers has been prepared for the competition. • The material is divided into six categories: classical music, jazz, blues, pop, rock and heavy metal. • For each of the performers 15-20 music pieces have been collected. • All music pieces are partitioned into 20 segments and parameterized. • The feature vector consists of 191 parameters.

Competition overview (Cont.) • Our goal is to estimate the music genre of newly given fragments of music tracks. • Input: • A training set of 12,495 vectors and their genre • A test set of 10,269 vectors without their genre • Output: 10,269 labels (Classical, Jazz, Rock, Blues, Metal or Pop). One for each vector in the test set. • The metric used for evaluating the solutions is standard accuracy, i.e. the ratio of the correctly classified samples to the total number of samples.

Preprocessing • I divided the training set into two sets. • A training set – used for constructiong each Bayesian Network in the population. • A verification set – used for computing the fitness of each network in the population. • These sets has the same amount of vectors for each category (Rock vectors, Pop vectors, etc.)

The three dimensions genetic algorithm • The three dimensions are: • Variables amount. • Variables choice. • Fixed discretization of the variables. • Every network in the population is a Naïve Bayesian Network, which means that its structure is already determined.

Fitness function • In order to compute the fitness of a network, we estimate the genre of each vector in the verification set, and compare it to it’s known genre. • The metric used for computing the fitness is standard accuracy, i.e. the ratio of the correctly classified vectors to the total number of vectors in the verification set.

Selection • In each generation, we choose population_size/2 different networks at most. • We prefer networks that have the highest fitness and are distinct from each other. • After choosing these networks we use them to build a fully sized population by mutating each one of them. • We use bitwise mutation to do so. • Notice that we may use a mutated network to generate a new mutated network.

Mutation • Bitwise mutation. • Parent: • BitSet • Dis • Child: • BitSet • Dis

Crossover • Single point crossover. • Parent 1: • Parent 2: • Child 1: • Child2:

Results • Model - Naive Bayesian • Population size - 40 • Generations - 400 • Variables - [1,191] • discretization - [5,15] • First population score (verification set) - 0.7756 • Best score (verification set) - 0.8327 • Website’s score (test set) - 0.7031 • “Zeroes” = 0

Observation • Notice that as the discretization interval increases, the CPTs of the network are getting bigger. • The number of vectors in the training set is fixed, so we get more occurrences of the number zero in the CPT’s. • These zeroes can harm the computation of the different genre probabilities. • As a solution, for each node, we will take the minimum value of it’s CPT, divide it by 10 and replace all zeroes in that CPT to the result.

Results (Cont.) • Model - Naive Bayesian • Population size - 40 • Generations - 400 • Variables - [1,191] • discretization - [5,15] • First population score - 0.7878 • Best score - 0.8415 • Website’s score - 0.7317 • “Zeroes” = cpt_min/10

Observation • Notice that there’s approximately 10% difference between my score and the website’s score. • We will discuss this issue later on.

Adding the forth dimension • The forth dimension is the structure of the Bayesian Network • Now, the population includes different Bayesian Networks. Meaning, networks with different structures, variables choice, variables amount and Discretization array.

Initial population • The networks in the initial population are distributed uniformly in the search space. • I’ve noticed that the algorithm tends to keep networks with a high number of variables. • Therefore, when generating the initial population, I increased the probability for getting networks with a low number of variables.

Evolution operations • The selection process is the same as in the previous algorithm. • The crossover and mutation are similar. • First, we start like the previous algorithm (Handling the BitSet and the discretization array) • Then, we add all the edges we can from the parent (mutation) or parents (crossover) to the child’s graph. • Finally, we make sure that the child’s graph is a connected acyclic graph.

Results • Model - Bayesian Network • Population size – 20 • Generations – Crashed on generation 104 • Variables - [1,191] • discretization - [2,6] • First population score - 0.4920 • Best score - ~0.8559 • Website’s score – Not available, Since it Crashed.

Memory problems • The program was executed on amdsrv3, with a 4.5 GB memory limit. • Even though the discretization interval is [2-6], the program has crashed due to java heap space error. • As a result I decided to decrease the population size to 10 instead of 20.

Results (Cont.) • Model - Bayesian Network • Population size – 10 • Generations – 800 • Variables - [1,191] • discretization - [2,10] • First population score - 0.5463 • Best score - 0.8686 • Website’s score - 0.7085

Results (Cont.) • Model - Bayesian Network • Population size – 10 • Generations – 800 • Variables - [1,191] • discretization - [2,20] • First population score - 0.5978 • Best score - 0.8708 • Website’s score - 0.6972

Overfitting • As we increase the discretization interval, my score increases and the website’s score decreases. • One explanation can be that increasing the search space may cause the algorithm to find patterns with strong correlation to the specific input data I received. While these patterns has no correlation at all to the real life data. • One possible solution can be to replace the training set or the verification set while the algorithm is running. • The problem is that we don’t have enough input data to do so.

Final competition scores

My score

The big picture • In order to really find patterns that describes the real life data, we have to find the best probabilistic model which represent this data. • Choosing the probabilistic model and building it are key factors in achieving such a goal. • The field of Data Mining suggests numerous techniques such as association rules, decision trees, frequent sequences, Markov Networks and clustering in order to build different classifiers.

The big picture (Cont.) • The Bayesian network classifier seems like a good tool at first, but it might miss some patterns that are vital for the perfect classifier. • These pattern might be identified using other classifiers.

Ideas for improvement • Parameter increasing may yield better results, but it also makes the program crash. • Therefore, better programming style or maybe parallel computing might help overcome this problem. • Instead of using fixed size discretization, we might want to use a more complex discretization technique such as clustering. • The idea is that, dividing a variable into intervals with different sizes, or even not continuous intervals, may cause this variable to improve the entire probabilistic model.

Ideas for improvement (Cont.) • We might want to use bitwise crossover, instead of single point crossover. Since, the order of the vector’s variables is insignificant.

References • Artificial Intelligence – A Modern Approach, Stuart Russell and Peter Norvig (Second edition). • Contest website: • http://tunedit.org/challenge/music-retrieval/genres • Battery power example: • http://www.bayesia.com

Learning Bayesian Networks through evolution

Learning Bayesian Networks through evolution

Presentation Transcript

Learning Bayesian Networks from Data

Bayesian Networks

Learning In Bayesian Networks

Learning with Bayesian Networks

Bayesian Learning and Learning Bayesian Networks

Learning with Bayesian Networks

Bayesian Networks

Learning With Bayesian Networks

Advances in Bayesian Learning Learning and Inference in Bayesian Networks

Learning Bayesian Networks from Data

Learning with Bayesian Networks

Learning in Bayesian Networks

Learning Bayesian Networks for Cellular Networks

Bayesian Networks

Learning with Bayesian Networks

Bayesian Networks

Learning Bayesian Networks

Learning Bayesian Networks

Learning Bayesian Networks for Cellular Networks

Learning Bayesian Networks

Structure Learning in Bayesian Networks

Bayesian Networks