Understanding Bayesian Belief Networks: Learning Structure & Parameters

CIS664 KD&DM G.F. Cooper, E. Herskovits A Bayesian Method for the Induction of Probabilistic Networks from Data Presented by Uroš Midić Mar 21, 2007

Introduction Bayesian Belief Network Learning Network Parameters Learning Network Structure Probability of Network Structure Given a Dataset Finding the Most Probable Network Structure K2 algorithm Experimental result Pros and cons

Introduction Events A and B are independent if P(A∩B)=P(A)P(B) or in the case that P(A)>0 and P(B)>0, P(A|B) = P(A) and P(B|A) = P(B)

Introduction Discrete-valued random variables X and Y are independent if for any a, b, the events X=a and Y=b are independent, i.e. P(X=a∩Y=b)=P(X=a)P(Y=b) Or P(X=a|Y=b) = P(X=a)

Introduction Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of Y given Z if for any a, b, c, P(X=a | Y=b,Z=c)=P(X=a|Z=c) This can be extended to sets of variables, e.g. we can say that X1…Xl is conditionally independent of Y1…Ym given Z1…Zn, if P(X1…Xl | Y1…Ym, Z1…Zn) = P(X1…Xl | Z1…Zn)

Introduction Let X = {X1, …, Xn} be a set of discrete-valued variables, and each variable Xi has a defined set of possible values V(Xi). The joint space of the set of variables X is defined as V(X1)xV(X2)x…xV(Xn). The probability distribution over the joint space specifies the probability for each of the possible variable bindings for the tuple (X1, …, Xn), and is called the joint probability distribution.

Bayesian Belief Networks A Bayesian Belief Network represents the joint probability distribution for a set of variables. It specifies a set of conditional independence assumptions (in form of a Directed Acyclic Graph) and sets of local conditional probabilities (in form of tables). Each variable is represented by a node in the graph. A network arc represents the assertion that the variable is conditionally independent of its non-descendants, given its immediate predecessors.

Bayesian Belief Networks Example: Storm BusTourGroup Lightning Campfire S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2 ¬C 0.6 0.9 0.2 0.8 Thunder ForestFire Campfire

Bayesian Belief Networks Storm BusTourGroup Lightning Campfire S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2 ¬C 0.6 0.9 0.2 0.8 Thunder ForestFire Campfire Example P(Campfire = True|Storm=True,BusTourGroup = True) = 0.4

Bayesian Belief Networks For any assignment of values (x1,…,xn) to the variables (X1, …, Xn), we can compute the joint probability P(x1,…,xn) = ∏i=1..nP(xi|Parents(Xi)) The values P(xi|Parents(Xi)) come directly from the tables associated with respective Xi. We can therefore recover any probability of the form P(X1|X2), where X1,X2 are subsets of X = {X1, …, Xn}.

Learning Network Parameters We are given a network structure. We are given a dataset, such that for each training example all variables have assigned values, i.e. we always get a full assignment (x1,…,xn) to the variables (X1, …, Xn). Then learning the conditional probability tables is straightforward: Initialize then counter in all the tables to 0, For each training example increase all the appropriate counters in the tables Convert the counts into probabilities.

Learning Network Parameters We are given a network structure but the dataset has missing values. In this case, learning the conditional probability tables is not straightforward, and usually involves a gradient-ascent procedure to maximize P(D|h) – probability of the dataset given the model h. Note that D (dataset) is fixed and we search for the best h that maximizes P(D|h).

Learning Network Structure The assumption for learning the network parameters is that we know (or assume) a network structure. In simple cases, the network structure is constructed by a domain expert. In most cases a domain expert is not available, or the dataset is so complicated that even a domain expert is powerless. Example: Gene-expression microarray datasets have > 10K variables.

Probability of N.S. Given a Dataset We are given a dataset and two networks. Which of the networks is better fit for this dataset: Case x1 x2 x3 1 + - - S1: 2 + + + 3 - - + 4 + + + S2: 5 - - - 6 - + + 7 + + + 8 - - - 9 + + + 10 - - - x1 x2 x3 x2 x1 x3 P(BS1|D) is much greater than P(BS2|D). S1 was used to generate the dataset.

Probability of N.S. Given a Dataset We are given a dataset and two networks. How to calculate P(BS,D)?

Probability of N.S. Given a Dataset Assumptions made in the paper: The database variables are discrete Cases occur independently, given a belief-network model. There are no cases that have variables with missing values. Prior probabilities (before observing the data) for conditional probability assignments are uniform.

Probability of N.S. Given a Dataset How to calculate P(BS,D)? Xi has a set of parents πi Xi has ri possible assignments (vi1, …, viri) There are qi unique instantiations (wi1, wi2, … , wiqi) for πi in D. Nijk is the number of cases in D in which variable xi has value vik and πi is instantiated as wij. Let Nij = ∑k=1..riNijk With the four assumptions we get a formula:

Probability of N.S. Given a Dataset How to calculate P(BS,D)? Surprisingly, after some indexing and constant bounding we get that the time complexity for this formula is O(mn) – linear in the number of variables and number of cases.

Probability of N.S. Given a Dataset We have an efficient way to calculate P(BS,D)? We could calculate all P(BSi|D) = P(BSi,D)/(∑P(BS,D)) and find the optimal one, but the set of possible BSi grows exponentially with n. However, if we had a situation where ∑BS∊YP(BS,D) ≈P(D), and Y is small enough, then we could efficiently approximate all P(BS|D) for BS∊Y.

K2 algorithm We start with: The time complexity is O(mn2r2n). However, if we assume that a node can have at most u parents, then the complexity is O(munrT(n,u)), where T(n,u) = ∑k=0..uchoose(n,k)

K2 algorithm Let πibe the set of parents of xi in Bs, which we denote as πi->xi. Then we can rewrite as P(BS) P(BS) = ∏i=1..n P(πi->xi), and we get a formula: If we assume an ordering of the variables, such that xi cannot be a parent of xj if i<j, then the number of possible πi-s for each xi is smaller, but the overall complexity is still exponential.

K2 algorithm K2 is a heuristic algorithm. It takes as input a set of n nodes, an ordering on the nodes, an upper bound on the number of parents a node may have, and a database D containing m cases. It starts with the assumption that a node has no parents, and then tries to incrementally add parents whose addition most increases the probability of the resulting structure. When the addition of no single parent can increase the probability, the algorithm continues with another node. This procedure is applied with respect to the ordering of the nodes (starting with X1 for which the ordering assumes that it cannot be a parent in the DAG). The time complexity is polynomial O(mu2n2r), which in the worst case, when u=n, is O(mn4r).

Experimental result Using a predefined network structure with 37 nodes and the associated conditional probabilities – provided by an expert in the domain of medicine – that describes potential problems with anaesthesia in the operating room, the authors generated a database with 10000 cases. They ran the algorithm, with the generated database, and an ordering of nodes that was consistent with the original structure. The algorithm almost completely reconstructed the original network. It missed one original arc, and added one arc that was not present in the original network.

Pros and cons + Any exact algorithm has exponential complexity, this heuristic algorithm has polynomial complexity. + The preliminary results are promising. + The algorithm can be extended to cover the databases with missing values. However, this extension is exponential in the number of missing values. – The algorithm still requires an ordering of the nodes/variables.

References G. Cooper and E. Herskovits, “A Bayesian Method for the Induction of Probabilistic Networks from Data”, Machine Learning 9 (1992) pp. 309-347. Tom Mitchell, Machine Learning.

Understanding Bayesian Belief Networks: Learning Structure & Parameters