Create Presentation
Download Presentation

Download Presentation
## Sampling Bayesian Networks

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Sampling Bayesian Networks**ICS 275b 2005**Approximation Algorithms**Structural Approximations • Eliminate some dependencies • Remove edges • Mini-Bucket Approach Search Approach for optimization tasks: MPE, MAP Sampling Generate random samples and compute values of interest from samples, not original network**Sampling**• Input: Bayesian network with set of nodes X • Sample = a tuple with assigned values s=(X1=x1,X2=x2,… ,Xk=xk) • Tuple may include all variables (except evidence) or a subset • Sampling schemas dictate how to generate samples (tuples) • Ideally, samples are distributed according to P(X|E)**Sampling**• Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(Xi|E) ?**Sampling Algorithms**• Forward Sampling • Likelyhood Weighting • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks**Forward Sampling**• Forward Sampling • Case with No evidence • Case with Evidence • N and Error Bounds**Forward Sampling No Evidence(Henrion 1988)**Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: • For t = 0 to T • For i = 0 to N • Xi sample xit from P(xi | pai)**r**0 0.3 1 Sampling A Value What does it mean to sample xit from P(Xi | pai) ? • Assume D(Xi)={0,1} • Assume P(Xi | pai) = (0.3, 0.7) • Draw a random number r from [0,1] If r falls in [0,0.3], set Xi = 0 If r falls in [0.3,1], set Xi=1**Sampling a Value**• When we sample xit from P(Xi | pai), most of the time, will pick the most likely value of Xi occasionally, will pick the unlikely value of Xi • We want to find high-probability tuples But!!!…. • Choosing unlikely value allows to “cross” the low probability tuples to reach the high probability tuples !**Forward Sampling-Answering Queries**Task: given n samples {S1,S2,…,Sn} estimate P(Xi = xi) : Basically, count the proportion of samples where Xi = xi**Forward Sampling w/ Evidence**Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E • For t=1 to T • For i=1 to N • Xi sample xit from P(xi | pai) • If Xi in E and Xi xi, reject sample: • i = 1 and go to step 2**Forward Sampling: Illustration**Let Y be a subset of evidence nodes s.t. Y=u**Forward Sampling –How many samples?**Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Chebychev’s Bound.**Forward Sampling - How many samples?**Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Hoeffding’s Bound (full proof is given in Koller).**Forward Sampling:Performance**Advantages: • P(xi | pa(xi)) is readily available • Samples are independent ! Drawbacks: • If evidence E is rare (P(e) is low), then we will reject most of the samples! • Since P(y) in estimate of N is unknown, must estimate P(y) from samples themselves! • If P(e) is small, T will become very big!**Problem: Evidence**• Forward Sampling • High Rejection Rate • Fix evidence values • Gibbs sampling (MCMC) • Likelyhood Weighting • Importance Sampling**Forward Sampling Bibliography**• {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988**Likelihood Weighting(Fung and Chang, 1990; Shachter and**Peot, 1990) “Clamping” evidence+ forward sampling+ weighing samples by evidence likelihood Works well for likelyevidence!**Likelihood Weighting**where**Likelyhood Convergence(Chebychev’s Inequality)**• Assume P(X=x|e) has mean and variance 2 • Chebychev: =P(x|e) is unknown => obtain it from samples!**Error Bound Derivation**K is a Bernoulli random variable**Likelyhood Convergence 2**• Assume P(X=x|e) has mean and variance 2 • Zero-One Estimation Theory (Karp et al.,1989): =P(x|e) is unknown => obtain it from samples!**Local Variance Bound (LVB)(Dagum&Luby, 1994)**• Let be LVB of a binary valued network:**LVB Estimate(Pradhan,Dagum,1996)**• Using the LVB, the Zero-One Estimator can be re-written:**Importance Sampling Idea**• In general, it is hard to sample from target distribution P(X|E) • Generate samples from sampling (proposal) distribution Q(X) • Weigh each sample against P(X|E)**Importance Sampling Variants**Importance sampling: forward, non-adaptive • Nodes sampled in topological order • Sampling distribution (for non-instantiated nodes) equal to the prior conditionals Importance sampling: forward, adaptive • Nodes sampled in topological order • Sampling distribution adapted according to average importance weights obtained in previous samples [Cheng,Druzdzel2000]**AIS-BN**• The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks. • Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.**Gibbs Sampling**• Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Samples are dependent, form Markov Chain • Samples directly from P(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality**MCMC Sampling Fundamentals**Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :**MCMC Sampling From (X)**A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:**Gibbs Sampling (Pearl, 1988)**• A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples**Ordered Gibbs Sampler**Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order**Gibbs Sampling (cont’d)(Pearl, 1988)**Markov blanket:**Ordered Gibbs Sampling Algorithm**Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)**Answering Queries**• Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):**Gibbs Sampling Example - BN**X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7**Gibbs Sampling Example - BN**X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7**Gibbs Sampling Example - BN**X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7**Gibbs Sampling Example - BN**X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7**Gibbs Sampling: Burn-In**• We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)**Gibbs Sampling: Convergence**• Converge to stationary distribution * : * = * P where P is a transition kernel pij = P(Xi Xj) • Guaranteed to converge iff chain is : • irreducible • aperiodic • ergodic ( i,j pij > 0)**Irreducible**• A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step). • In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.**Aperiodic**• Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.**Ergodicity**• A recurrent state is a state to which the chain returns with probability 1: nP(n)ij = • Recurrent, aperiodic states are ergodic. Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.**Gibbs Convergence**• Gibbs convergence is generally guaranteed as long as all probabilities are positive! • Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then: • once we sample and assign X=0, then we are forced to assign Y=0; • once we sample and assign Y=0, then we are forced to assign X=0; we will never be able to change their values again! • Another problem: it can take a very long time to converge!