Bayesian Learning

Bayesian Learning • Introduction • Bayes Theorem • Maximum Likelihood Estimation • Bayes Optimal Classifier and Naïve Bayes • Bayesian Belief Networks

Bayesian Belief Networks A Bayesian Belief Network is a method to describe the joint probability distribution of a set of variables. Let x1, x2, …, xn be a set of variables or features. A Bayesian Belief Network or BBN will tell us the probability of any combination of x1, x2 , .., xn. Similar to Naïve Bayes we will make some independence assumptions, but not as strong as the assumption of all variables being independent.

An Example Set of Boolean variables and their relations: Storm Bus Tour Group Lightning Campfire Thunder Forest Fire

Conditional Probabilities S,B S,~B ~S,B ~S,~B C 0.4 0.1 0.8 0.2 ~C 0.6 0.9 0.2 0.8 Storm Bus Tour Group Campfire

Conditional Independence We say x1 is conditionally independent of x2 given x3 if the probability of x1 is independent of x2 given x3: P(x1|x2,x3) = P(x1|x3) The same can be said for a set of variables: x1,x2,x3 is independent of y1,y2,y3 given z1,z2,z3: P(x1,x2,x2|y1,y2,y3,z1,z2,z3) = P(x1,x2,x3|z1,z2,z3)

Representation • A BBN represents the joint probability distribution of a set • of variables by explicitly indicating the assumptions of • conditional independence through the following: • A directed acyclic graph and • Local conditional probabilities. Storm Bus Tour Group variable conditional probabilities Campfire

Representation Each variable is independent of its nondescendants given its predecessors. We say x1 is a descendant of x2 if there is a direct path from x2 to x1. Example: Predecessors of Campfire: Storm, Bus Tour Group. (Campfire is a descendant of these two variables). Campfire is independent of Lightning given its predecessors. Bus Tour Group Storm Lightning Campfire

Joint Probability Distributionn To compute the joint probability distribution of a set of variables given a Bayesian Belief Network we simply use the following formula: P(x1,x2,…,xn) = Π i P(xi | Parents(xi)) Where parents are the immediate predecessors of xi.

Joint Probability Distributionn Example: P(Campfire, Storm, BusGroupTour, Lightning, Thunder, ForestFire)? P(Storm)P(BusTourGroup)P(Campfire|Storm,BusTourGroup) P(Lightning|Storm)P(Thunder|Lightning) P(ForestFire|Lightning,Storm,Campfire).

Joint Distribution, An Example Storm Bus Tour Group Lightning Campfire Thunder Forest Fire P(Storm)P(BusTourGroup)P(Campfire|Storm,BusTourGroup) P(Lightning|Storm)P(Thunder|Lightning) P(ForestFire|Lightning,Storm,Campfire).

Conditional Probabilities, An Example S,B S,~B ~S,B ~S,~B C 0.4 0.1 0.8 0.2 ~C 0.6 0.9 0.2 0.8 Storm Bus Tour Group P(Campfire=true|Storm=true,BusTourGroup=true) = 0.4 Campfire

Inference and Learning What is the connection between a BBN and classification? Suppose one of the variables is the target variable. Can we compute the probability of the target variable given the other variables? In Naïve Bayes: Concept C X1 xn X2 … P(x1,x2,…xn,c) = P(c) P(x1|c) P(x2|c) … P(xn|c)

General Case In the general case we can use a BBN to specify independence assumptions among variables. General Case: Concept C X1 x4 X2 X3 P(x1,x2,…xn,c) = P(c) P(x1|c) P(x2|c) P(x3|x1,x2,c)P(x4,c)

Learning Belief Networks • We can learn BBN in different ways. Two basic • approaches follow: • Assume we know the network structure: • We can estimate the conditional probabilities • for each variable from the data. • Assume we know part of the structure but some variables • are missing: • This is like learning hidden units in a neural network. • One can use a gradient ascent method to train the BBN.

Learning Belief Networks • Assume nothing is known. • We can learn the structure and conditional • probabilities by looking in the space of possible • networks.

The EM Algorithm • It is a method to learn in the presence of unobserved variables. • It is based on the following idea. • Let x1, x2, …, xn be a set of hidden variables for which we • don’t know their actual value. • Start with some hypothesis h about the model. • Iterate: • Calculate the expected value of x1,x2,…xn assuming • hypothesis h is true. • Calculate a new hypothesis h’ assuming the value • taken by each hidden variable is its expected value.

Summary • Bayesian Belief Networks provide a framework to • compute the joint probability distribution of a set of • variables. • BBN explicitly indicate independence relation. • Applied to learning, a BBN can serve to predict the • value of a target variable. • A BBN can be learned from data even if the network • structure is not known.

Bayesian Learning