450 likes | 613 Vues
Inference Algorithms for Bayes Networks. Outline. Bayes Nets are popular representations in AI, and researchers have developed many inference techniques for them. We will consider two types of algorithms: Exact inference (with 2 subtypes) Enumeration Variable elimination
E N D
Outline Bayes Nets are popular representations in AI, and researchers have developed many inference techniques for them. We will consider two types of algorithms: • Exact inference (with 2 subtypes) • Enumeration • Variable elimination • Other techniques not covered: Junction tree, loop-set conditioning, … • Approximate inference (sampling) (with 3 sub-types) • Rejection sampling • Likelihood weighting • Gibbs sampling
First: Notation I’m going to assume all variables are binary. For random variable A, I will write the event that A is true as +a, and –a for A is false. Similarly for the other variables. A B C D E
Technique 1: Enumeration This is the “brute-force” approach to BN inference. Suppose I want to know P(+a | +b, +e). Algorithm: 1) If query is conditional (yes in this case), rewrite with def. of cond. prob. 2) Use marginalization to rewrite marginal probabilities in terms of the joint probability. e.g., 3) Use the Bayes Net equation to determine the joint probability. A B C D E
Speeding up Enumeration Pulling out terms: Each term in the sum is faster. But: the total number of terms (things to add up) remains the same. In the worst case, this is still exponential in the number of nodes.
Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Let’s re-create the network on the left, but start with the “John Calls” node and gradually add more nodes and edges. Let’s see how many edges/dependencies we end up with. Earthquake Burglary Alarm John calls Mary calls
Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. John calls Earthquake Burglary Mary calls ? Alarm John calls Mary calls
Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. John calls Earthquake Burglary Mary calls ? ? Alarm Alarm John calls Mary calls
Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. John calls Earthquake Burglary Mary calls ? Alarm ? Alarm ? John calls Mary calls Burglary
Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. John calls Earthquake Burglary Mary calls Alarm Alarm ? ? ? John calls Mary calls Burglary Earthquake ?
Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. John calls Earthquake Burglary Mary calls Alarm Alarm John calls Mary calls Burglary Earthquake
Causal Direction Moral: Bayes Nets tend to be the most compact, and most efficient, when edges go from causes to effects. John calls Earthquake Burglary Mary calls Alarm Alarm John calls Mary calls Burglary Earthquake Causal direction Non-causal direction
Technique 2: Variable Elimination • Suppose I want to know P(+a | +b, +e). • Algorithm: • 1) If query is conditional (yes in this case), • rewrite with def. of cond. prob. • 2) For each marginal distribution, apply variable elimination to find that probability. • e.g., for • Join C & D (multiplication) • Eliminate D (marginalization) • Join C & +e (multiplication) • Eliminate C (marginalization) • Join +a & +e (multiplication) • Join +b & (+a, +e) (multiplication) • Done. A B C D E
Joining D & C A B A B C C, D D E E Bayes Net provides: P(C | +a, +b) P(D | C) Joining D & C will compute P(D, C | +a, +b) For each c and each d, compute: P(d, c | +a, +b) = P(d | c) * P(c | +a, +b)
Eliminating D A A B B C, D C E E Bayes Net now provides: P(D, C | +a, +b) Eliminating D will compute P(C | +a, +b) For each c, compute: P(c | +a, +b) = dP(d, c | +a, +b)
Joining C and +e A A B B C C, E E Bayes Net now provides: P(C | +a, +b) P(+e | C) Joining C and +e will compute P(+e, C | +a, +b) For each c, compute: P(+e, c | +a, +b) = P(c | +a, +b)*P(+e | c)
Eliminating C A A B B C, E E Bayes Net now provides: P(+e, C | +a, +b) Eliminating C will compute P(+e | +a, +b) Compute: P(+e | +a, +b) = c P(+e, c | +a, +b)
Joining +a, +b, and +e A B E A, B, E Bayes Net now provides: P(+e | +a, +b) P(+a), P(+b) Joining +a, +b, and +e will compute P(+e, +a, +b) Compute: P(+e, +a , +b) = P(+e | +a, +b) * P(a) * P(b)
Notes on Time Complexity For graphs that are trees with N nodes, variable elimination can perform inference in time O(N). For general graphs, variable elimination can perform inference in time that O(2w), where w is the “tree-width” of the graph. (However, this depends on the order in which variables are eliminated, and it is hard to figure out the best order.) Intuitively, tree-width is a measure of how close a graph is to an actual tree. In the worst case, this can mean a time complexity that is exponential in the size of the graph. Exact inference in BNs is known to be NP-hard.
Approximate Inference via Sampling As the number of samples increases, our estimates should approach the true joint distribution. Conveniently, we get to decide how long we want to spend to figure out the probabilities.
Generating Samples from a BN A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X +x 3. Else, X -x B C D For this example: At first, A is the only variable whose parents have been assigned (since it has no parents). r 0.3 0.3 < P(+a), so we assign A +a
Generating Samples from a BN A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X +x 3. Else, X -x B C D For this example: Current Sample: +a Next, both B and C have all their parents assigned. Let’s choose B. r .9 .9 >= P(+b | +a), so we set B -b
Generating Samples from a BN A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X +x 3. Else, X -x B C D For this example: Current Sample: +a, -b Quiz: what variable would be assigned next? If r .4, what would this variable be assigned?
Generating Samples from a BN A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X +x 3. Else, X -x B C D For this example: Current Sample: +a, -b, -c Now D has all its parents assigned. If r .2, what would D be assigned?
Generating Samples from a BN A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X +x 3. Else, X -x B C D For this example: Current Sample: +a, -b, -c, +d That completes this sample. We can now increase the count of (+a, -b, -c, +d) by 1, and move on to the next sample.
Quiz: Approximating Queries Suppose I generate a bunch of samples for a BN with variables A, B, C, and get these counts. What are these probabilities? P(+a, -b, -c)? P(+a, -c)? P(-a | -b, -c)? P(-b | +a)?
Technique 3: Rejection Sampling Rejection sampling is the fancy name given to the procedure you just used to compute, eg., P(-a | -b, -c). To compute this, you ignore (or “reject”) samples where B = +b or C = +c, since they don’t match the evidence in the query.
Consistency Rejection sampling is a consistent approximate inference technique. Consistency means that as the number of samples increases, the estimated value of the probability for a query approaches its true value. In the limit of infinite samples, consistent sampling techniques give the correct probabilities.
Room for Improvement Efficiency of Rejection Sampling: If you’re interested in a query like P(+a | +b, +c), you’ll reject 5 out of 6 samples, since only 1 out of 6 samples have the right evidence (+b and +c). So most samples are useless for your query.
Technique 4: Likelihood Weighting Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample {}, P(sample) 1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample) P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {} P(sample): 1 At first, A is the only variable whose parents have been assigned (since it has no parents). r 0.3 0.3 < P(+a), so we assign A +a
Likelihood Weighting Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample {}, P(sample) 1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample) P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a} P(sample): 1 B and C have their parents assigned. Let’s do B next. B is an evidence node, so we choose B +b (from the query) Also, P(+b|+a) = .7, so we update P(sample) 0.7
Likelihood Weighting Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample {}, P(sample) 1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample) P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a, +b} P(sample): 0.7 C has its parents assigned. It is NOT an evidence node. r .8 .8 >= P(+c | +a), so C -c P(sample) is NOT UPDATED.
Likelihood Weighting Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample {}, P(sample) 1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample) P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a, +b, -c} P(sample): 0.7 D has its parents assigned. How do the sample and P(sample) change?
Likelihood Weighting Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample {}, P(sample) 1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample) P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a, +b, -c, +d} P(sample): 0.42
Likelihood Weighting vs. Rejection Sampling Rejection Sampling Likelihood Weighting for query P(+c | -a) Requires fewer samples to get good estimates. But solves just one query at a time. Needs LOTS of samples. Can answer any query. Both are consistent.
Further room for improvement A Example query of interest: P(+d | +b, +c) B C If we generate samples using likelihood weighting, the choice of sample for D takes into account the evidence. However, the choice of sample for A does NOT take into account the evidence. So we may generate lots of samples that are very unlikely, and don’t contribute much to our overall counts. Quiz: what is P(+a | +b, +c)? And P(-a | +b, +c)? D
Technique 5: Gibbs Sampling Named after physicist Josiah Gibbs (you may have heard of Gibbs Free Energy). This is a special case of a more general algorithm called Metropolis-Hastings, which is itself a special case of Markov-Chain Monte Carlo (MCMC) estimation.
Gibbs Sampling Query of interest: P(-d | +b, -c) A Sample generation algorithm: Initialize: sample {Arandom, +b, -c, D random} Repeat: 1. pick a non-evidence variable X 2. Get a random number r in the range [0, 1] 3. If r < P(X | all other variables), set X +x 4. Otherwise, set X -x 5. Add 1 to the count for this new sample B C D For this example: Sample: {-a, +b, -c, +d} A and D are non-evidence. Randomly choose D to re-set. r 0.7 P(+d | -a, +b, -c) = P(+d | +b, -c) = .6 r >= .6, so D = -d
Gibbs Sampling Query of interest: P(-d | +b, -c) A Sample generation algorithm: Initialize: sample {Arandom, +b, -c, D random} Repeat: 1. pick a non-evidence variable X 2. Get a random number r in the range [0, 1] 3. If r < P(X | all other variables), set X +x 4. Otherwise, set X -x 5. Add 1 to the count for this new sample B C D For this example: Sample: {-a, +b, -c, -d} A and D are non-evidence. Randomly choose D to re-set. r 0.9 P(+d | -a, +b, -c) = P(+d | +b, -c) = .6 r >= .6, so D = -d (no change)
Gibbs Sampling Query of interest: P(-d | +b, -c) A Sample generation algorithm: Initialize: sample {Arandom, +b, -c, D random} Repeat: 1. pick a non-evidence variable X 2. Get a random number r in the range [0, 1] 3. If r < P(X | all other variables), set X +x 4. Otherwise, set X -x 5. Add 1 to the count for this new sample B C D For this example: Sample: {-a, +b, -c, -d} A and D are non-evidence. Randomly choose A to re-set. r 0.3 P(+a | +b, -c, -d) = P(+a | +b, -c) = ? What is A after this step?
Details of Gibbs Sampling • To compute P(X | all other variables), it is enough to consider only the Markov Blanket of X: • X’s parents, X’s children, and the parents of X’s children. • Everything else will be conditionally independent of X, given its Markov Blanket. • Unlike Rejection Sampling and Likelihood Weighting, samples in Gibbs Sampling are NOT independent. • Nevertheless, Gibbs Sampling is consistent. • It is very common to discard the first N (often N ~= 1000) samples from a Gibbs sampler. The first N samples are called the “burn-in” period.