1 / 30

CS b553 : A lgorithms for Optimization and Learning

CS b553 : A lgorithms for Optimization and Learning. Bayesian Networks. agenda. B ayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations D-separation Probabilistic inference queries. Purposes of bayesian Networks.

lluvia
Télécharger la présentation

CS b553 : A lgorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b553: Algorithms for Optimization and Learning Bayesian Networks

  2. agenda • Bayesian networks • Chain rule for Bayes nets • Naïve Bayes models • Independence declarations • D-separation • Probabilistic inference queries

  3. Purposes of bayesianNetworks • Efficient and intuitive modeling of complex causal interactions • Compact representation of joint distributions O(n) rather than O(2n) • Algorithms for efficient inference with given evidence (more on this next time)

  4. Independence of random variables • Two random variables a and b are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A) • Knowing b doesn’t give you any information about a • [This equality has to hold for all combinations of values that Aand B can take on, i.e., all events A=a and B=b are independent]

  5. Significance of independence • If A and B are independent, then P(A,B) = P(A) P(B) • => The joint distribution over A and B can be defined as a product over the distribution of Aand the distribution of B • => Store two much smaller probability tables rather than a large probability table over all combinations of Aand B

  6. Conditional Independence • Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C)hence P(A|B,C) = P(A|C) • Once you know C, learning Bdoesn’t give you any information about A • [again, this has to hold for all combinations of values that A,B,C can take on]

  7. Significance of Conditional independence • Consider Grade(CS101), Intelligence, and SAT • Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores • but good students are more likely to get good SAT scores, so they are not independent… • It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

  8. bayesianNetwork • Explicitly represent independence among propositions • Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I) Intel. Grade SAT 6probabilities, instead of 11

  9. Definition: bayesian network • Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn) • Each node has a set of parents PaX • Graph must be a DAG • Each node also maintains a conditional probability distribution (often, a table) • P(X|PaX) • 2k-1entries for binary valued variables • Overall: O(n2k) storage for binary variables • Encodes the joint probability over X1,…,Xn

  10. Burglary Earthquake Alarm JohnCalls MaryCalls Calculation of joint Probability P(jmabe) = ??

  11. Burglary Earthquake Alarm JohnCalls MaryCalls • P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e)  P(m|a,b,e)  P(abe)(J and M are independent given A) • P(j|a,b,e) = P(j|a)(J and Band Jand E are independent given A) • P(m|a,b,e) = P(m|a) • P(abe) = P(a|b,e)  P(b|e)  P(e) = P(a|b,e)  P(b)  P(e)(B and Eare independent) • P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)

  12. Burglary Earthquake alarm JohnCalls MaryCalls Calculation of joint Probability P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

  13. Burglary Earthquake alarm P(x1x2…xn) = Pi=1,…,nP(xi|paXi) johnCalls maryCalls  full joint distribution Calculation of joint Probability P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

  14. Chain Rule for Bayes Nets • Joint distribution is a product of all CPTs • P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)

  15. Example: Naïve bayes models • P(Cause,Effect1,…,Effectn)= P(Cause) PiP(Effecti| Cause) Cause Effect1 Effect2 Effectn

  16. Advantages of Bayes Nets (and other graphical models) • More manageable # of parameters to set and store • Incremental modeling • Explicit encoding of independence assumptions • Efficient inference techniques

  17. Arcs do not necessarily encode causality A C C B B B C A A 2 BN’s with the same expressive power, and a 3rd with greater power (exercise)

  18. Reading off independence relationships • Given B, does the value of A affect the probability of C? • P(C|B,A) = P(C|B)? • No! • C parent’s (B) are given, and so it is independent of its non-descendents (A) • Independence is symmetric:C  A | B => A  C | B A B C

  19. Basic Rule • A node is independent of its non-descendants given its parents (and given nothing else)

  20. Burglary Earthquake Alarm JohnCalls MaryCalls What does the BN encode? Burglary  Earthquake JohnCallsMaryCalls | Alarm JohnCalls Burglary | Alarm JohnCalls Earthquake | Alarm MaryCalls Burglary | Alarm MaryCalls Earthquake | Alarm A node is independent of its non-descendents, given its parents

  21. Burglary Earthquake Alarm JohnCalls MaryCalls Reading off independence relationships • How about Burglary Earthquake | Alarm ? • No! Why?

  22. Burglary Earthquake Alarm JohnCalls MaryCalls Reading off independence relationships • How about Burglary  Earthquake | Alarm ? • No! Why? • P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 • P(B|A)P(E|A) = 0.086

  23. Burglary Earthquake Alarm JohnCalls MaryCalls Reading off independence relationships • How about Burglary  Earthquake | JohnCalls? • No! Why? • Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent

  24. Independence relationships • For polytrees, there exists a unique undirected path between A and B. For each node on the path: • Evidence on the directed road XEY or XEY makes X and Y independent • Evidence on an XEY makes descendants independent • Evidence on a “V” node, or below the V: XEY, or XWY with W…Emakes the X and Y dependent(otherwise they are independent)

  25. General case • Formal property in general case: • D-separation : the above properties hold for all (acyclic) paths between A and B • D-separation  independence • That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation • The CPTs may indeed encode additional independences

  26. Probability Queries • Given: some probabilistic model over variables X • Find: distribution over YX given evidence E=e for some subset E X / Y • P(Y|E=e) • Inference problem

  27. Answering Inference Problems with the Joint Distribution • Easiest case: Y=X/E • P(Y|E=e) = P(Y,e)/P(e) • Denominator makes the probabilities sum to 1 • Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e) • Otherwise, let Z=X/(EY) • P(Y|E=e) = Sz P(Y,Z=z,e) /P(e) • P(e) = SySz P(Y=y,Z=z,e) • Inference with joint distribution: O(2|X/E|) for binary variables

  28. P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C)Pi P(Fi|C) Given features, what class? Naïve bayesClassifier • P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class) Spam / Not Spam English / French / Latin … Class Feature1 Feature2 Featuren Word occurrences

  29. Naïve bayesClassifier • P(Class,Feature1,…,Featuren)= P(Class) Pi P(Featurei | Class) Given some features, what is the distribution over class? P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk) = 1/Z Sfk+1…fnP(C,F1,….,Fk,fk+1,…fn) = 1/Z P(C)Sfk+1…fnPi=1…kP(Fi|C)Pj=k+1…n P(fj|C) = 1/Z P(C)Pi=1…kP(Fi|C)Pj=k+1…nSfjP(fj|C) = 1/Z P(C)Pi=1…k P(Fi|C)

  30. For General Queries • For BNs and queries in general, it’s not that simple… more in later lectures. • Next class: skim 5.1-3, begin reading 9.1-4

More Related