1 / 41

Approximate Inference

Approximate Inference. Edited from Slides by Nir Friedman. Complexity of Inference. Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates. Proof. We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem:

claycomb
Télécharger la présentation

Approximate Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Inference Edited from Slides by Nir Friedman .

  2. Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates.

  3. Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem: • q1,…,qn be propositions, • 1 ,... ,k be clauses, such that i = li1 li2  li3 where each lij is a literal over q1,…,qn •  = 1... k We will construct a Bayesian network s.t. P(X=t) > 0 iff  is satisfiable

  4. ... P(Qi = true) = 0.5, P(I = true| Qi , Qj , Ql ) = 1 iff Qi , Qj , Qlsatisfy the clause I A1, A2, …, are simple binary AND gates Q1 Q2 Q3 Q4 Qn ... k-1 k 1 2 3 ... X A2 Ak/2-1 A1

  5. It is easy to check • Polynomial number of variables • Each local probability table can be described by a small table (8 parameters at most) • P(X = true) > 0 if and only if there exists a satisfying assignment to Q1,…,Qn • Conclusion: polynomial reduction of 3-SAT

  6. Note: this construction also shows that computing P(X = t) is harder than NP • 2nP(X = t) is the number of satisfying assignments to  • Thus, it is #P-hard.

  7. Hardness - Notes • We used deterministic relations in our construction • The same construction works if we use (1-, ) instead of (1,0) in each gate for any  < 0.5 • Hardness does not mean we cannot solve inference • It implies that we cannot find a general procedure that works efficiently for all networks • For particular families of networks, we can have provably efficient procedures • We have seen such families in the course: HMMs, Evolutionary trees.

  8. Approximation • Until now, we examined exact computation • In many applications, approximation are sufficient • Example: P(X = x|e) = 0.3183098861838 • Maybe P(X = x|e)  0.3 is a good enough approximation • e.g., we take action only if P(X = x|e) > 0.5 • Can we find good approximation algorithms?

  9. Types of Approximations Absolute error • An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e) q +  • Absolute error is not always what we want: • If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable • If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 1 q 2 0

  10. Types of Approximations Relative error • An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - )  q  P(X = x|e)(1 + ) equivalently q/(1 + )  P(X = x|e)  q/(1 - ) • Sensitivity of approximation depends on actual value of desired result 1 q/(1-) q q/(1+) 0

  11. Complexity • Exact inference is NP-hard • Is approximate inference any easier? • Construction for exact inference: • Input: a 3-SAT problem  • Output: a BN such that P(X=t) > 0 iff  is satisfiable

  12. Complexity: Relative Error • Suppose that q is a relative error estimate ofP(X = t)=0. • Then, Theorem: Given , finding an -relative error approximation is NP-hard. 0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0 namely, q=0. Thus, -relative error and exact computation coincide for the value 0.

  13. Complexity: Absolute error Theorem • If  < 0.5, then finding an estimate of P(X=x|e) with  absolute error approximation is NP-Hard

  14. ... Q1 Q2 Q3 Q4 Qn ... k-1 k 1 2 3 ... ... A1 X A2 Proof • Recall our construction

  15. Proof (cont.) • Suppose we can estimate with  absolute error • Let p1 P(Q1 = t | X = t) • Assign q1 = t if p1 > 0.5, else q1 = f • Let p2 P(Q2 = t | X = t, Q1 = q1 ) • Assign q2 = t if p2 > 0.5, else q2 = f • … • Let pn P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) • Assign qn = t if pn > 0.5, else qn = f

  16. Proof (cont.) Claim: if  is satisfiable, then q1,…,qn is a satisfying assignment • Suppose  is satisfiable • By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi • Base case: • If Q1 = t in all satisfying assignments, • P(Q1 = t | X = t) = 1 • p1 1 -  > 0.5 •  q1 = t • If Q1 = f, in all satisfying assignments, then q1 = f • Otherwise, the statement holds for any choice of q1

  17. Proof (cont.) • Induction argument: • If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi • P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1 • pi+1 1 -  > 0.5 •  qi+1 = t • If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f. Otherwise, the statement holds for any choice of qi .

  18. Proof (cont.) • We can efficiently check whether q1,…,qn is a satisfying assignment (linear time) • If it is, then  is satisfiable • If it is not, then  is not satisfiable • Suppose we have an approximation procedure with  relative error. • We can determine 3-SAT with n procedure calls. We generate an assignment as in the proof, and check satisfyability of the resulting assignment in linear time. If there were a satisfiable solution, we showed one would find it, and if no such assignment exists, one won’t find it. • Thus, approximation is NP-hard.

  19. When can we hope to approximate? Two situations: • “Peaked” distributions improbable values are ignored • Highly stochastic distributions “Far” evidence is discarded. (E.g., far markers in genetic linkage analysis)

  20. Stochastic Simulation • Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn) • What is the probability that a random sample <x1,…,xn> satisfies e? • This is exactly P(e) • We can view each sample as tossing a biased coin with probability P(e) of “Heads”

  21. 1 or 0 Stochastic Sampling • Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate • Law of large number implies that as N grows, our estimate will converge to p with high probability.

  22. Sampling a Bayesian Network • If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? • YES: sample according to structure of the network: sample each variable given its sampled parents

  23. 0.03 Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8

  24. 0.001 Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e

  25. Burglary Earthquake 0.4 Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a

  26. Burglary Earthquake Radio Alarm 0.8 Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c

  27. Burglary Earthquake Radio Alarm 0.3 Call B E A C R b b e e a b e b e r Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c

  28. Logic Sampling • Let X1, …, Xn be order of variables consistent with arc direction • for i = 1, …, n do • sample xi from P(Xi | pai ) • (Note: since Pai {X1,…,Xi-1}, we already assigned values to them) • return x1, …,xn

  29. Logic Sampling • Sampling a complete instance is linear in number of variables • Regardless of structure of the network • However, if P(e) is small, we need many samples to get a decent estimate

  30. Can we sample from P(X1,…,Xn |e)? • If evidence is in roots of the network, as before. • If evidence is in leaves of the network, we have a problem: Our sampling method proceeds according to order of nodes in graph. We need to retain only those samples that match e. This might be a rare event.

  31. Y X Likelihood Weighting • Can we ensure that all of our sample is used? • One wrong (but fixable) approach: • When we need to sample a variable that is assigned a value by e, use that specified value. • For example: we know Y = 1 • Sample X from P(X) • Then take Y = 1 • This is NOT a sample from P(X,Y|Y = 1) !

  32. 1 or 0 Y X Likelihood Weighting • Problem: these samples of X are from P(X) • Solution: • Penalize samples in which P(Y=1|X) is small • We now sample as follows: • Let x[i] be a sample from P(X) • Let w[i] be P(Y = 1|X = x [i])

  33. 0.03 Burglary Earthquake = a Radio Alarm Call Weight b b e r b e b e Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a a P(c) B E A C R 0.05 0.8 Samples:

  34. 0.001 Burglary Earthquake = a Radio Alarm Call B E A C R Weight b b e a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 e

  35. Burglary Earthquake 0.4 = a Radio Alarm Call B E A C R Weight b b e a a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 0.6 e

  36. Burglary Earthquake = a Radio Alarm 0.05 Call B E A C R Weight b b e a a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 0.6 e c

  37. Burglary Earthquake = a Radio Alarm 0.3 Call Weight b b e a a r b e b e Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) B E A C R 0.05 0.8 0.6 *0.3 e c r Samples:

  38. Likelihood Weighting • Let X1, …, Xn be order of variables consistent with arc direction • w = 1 • for i = 1, …, n do • if Xi = xi has been observed • w w  P(Xi= xi| pai ) • else • sample xi from P(Xi | pai ) • return x1, …,xn, and w

  39. Likelihood Weighting • Why does this make sense? • When N is large, we expect to sample NP(X = x) samples with x[i] = x • Thus,

  40. Theorem (Dagum & Luby AIJ93): • If P(Xi | Pai) [l,u] for all local probability tables, and • then with probability 1-, the estimate is  relative error approximation Likelihood Weighting What can we say about the quality of answer? • Intuitively, the weights of a sample reflects their probability given the evidence. We need collect a enough mass for the sample to provide accurate answer. • Another factor is the “extremeness” of CPDs.

  41. END

More Related