1 / 113

Learning Bayesian Networks from Data

Learning Bayesian Networks from Data. Nir Friedman Daphne Koller Hebrew U. Stanford. Overview. Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data. Qualitative part :

roch
Télécharger la présentation

Learning Bayesian Networks from Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford .

  2. Overview • Introduction • Parameter Estimation • Model Selection • Structure Discovery • Incomplete Data • Learning from Structured Data

  3. Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Family of Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 0.01 0.99 e b Bayesian Networks Compact representation of probability distributions via conditional independence Burglary Earthquake Radio Alarm Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions

  4. MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV PVSAT ANAPHYLAXIS ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients • 37 variables • 509 parameters …instead of 254

  5. Burglary Earthquake Radio Alarm Call Inference • Posterior probabilities • Probability of any event given any evidence • Most likely explanation • Scenario that explains evidence • Rational decision making • Maximize expected utility • Value of Information • Effect of intervention Radio Call

  6. Why learning? Knowledge acquisition bottleneck • Knowledge acquisition is an expensive process • Often we don’t have an expert Data is cheap • Amount of available information growing rapidly • Learning allows us to construct models from raw data

  7. Why Learn Bayesian Networks? • Conditional independencies & graphical language capture structure of many real-world distributions • Graph structure provides much insight into domain • Allows “knowledge discovery” • Learned model can be used for many tasks • Supports all the features of probabilistic learning • Model selection criteria • Dealing with missing data & hidden variables

  8. E B P(A | E,B) B E .9 .1 e b e b .7 .3 .8 .2 e b R A .99 .01 e b C Learning Bayesian networks Data + Prior Information Learner

  9. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B B E E ? ? e b A A e b ? ? ? ? e b ? ? e b Known Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> • Network structure is specified • Inducer needs to estimate parameters • Data does not contain missing values Learner

  10. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b B E .99 .01 e b A E B P(A | E,B) B E ? ? e b A e b ? ? ? ? e b ? ? e b Unknown Structure, Complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> • Network structure is not specified • Inducer needs to select arcs & estimate parameters • Data does not contain missing values Learner

  11. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B B E E ? ? e b A A e b ? ? ? ? e b ? ? e b Known Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> • Network structure is specified • Data contains missing values • Need to consider assignments to missing values Learner

  12. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B E ? ? e b A e b ? ? ? ? e b ? ? e b Unknown Structure, Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> • Network structure is not specified • Data contains missing values • Need to consider assignments to missing values Learner B E A

  13. Overview • Introduction • Parameter Estimation • Likelihood function • Bayesian estimation • Model Selection • Structure Discovery • Incomplete Data • Learning from Structured Data

  14. B E A C Learning Parameters • Training data has the form:

  15. B E A C Likelihood Function • Assume i.i.d. samples • Likelihood function is

  16. B E A C Likelihood Function • By definition of network, we get

  17. B E A C Likelihood Function • Rewriting terms, we get =

  18. General Bayesian Networks Generalizing for any Bayesian network: Decomposition  Independent estimation problems

  19. L( :D) Count of kth outcome in D General case: Probability of kth outcome  0 0.2 0.4 0.6 0.8 1 Likelihood Function: Multinomials • The likelihood for the sequence H,T, T, H, H is

  20. Bayesian Inference • Represent uncertainty about parameters using a probability distribution over parameters, data • Learning using Bayes rule Likelihood Prior Posterior Probability of data

  21. Bayesian Inference • Represent Bayesian distribution as Bayes net • The values of X are independent given  • P(x[m] |  ) =  • Bayesian prediction is inference in this network  X[1] X[2] X[m] Observed data

  22. 0 0.2 0.4 0.6 0.8 1 Example: Binomial Data • Prior: uniform for in [0,1]  P(|D)  the likelihood L(:D) (NH,NT) = (4,1) • MLE for P(X = H ) is 4/5 = 0.8 • Bayesian prediction is

  23. Dirichlet Priors • Recall that the likelihood function is • Dirichlet prior with hyperparameters 1,…,K  the posterior has the same form, with hyperparameters 1+N 1,…,K +N K

  24. Dirichlet Priors - Example 5 4.5 Dirichlet(heads, tails) 4 3.5 3 Dirichlet(5,5) P(heads) Dirichlet(0.5,0.5) 2.5 2 Dirichlet(2,2) 1.5 Dirichlet(1,1) 1 0.5 0 0 0.2 0.4 0.6 0.8 1 heads

  25. Dirichlet Priors (cont.) • If P() is Dirichlet with hyperparameters 1,…,K • Since the posterior is also Dirichlet, we get

  26. Y|X X X m Y|X X[m] X[M] X[M+1] X[1] X[2] Y[m] Y[M] Y[M+1] Y[1] Y[2] Plate notation Query Observed data Bayesian Nets & Bayesian Prediction • Priors for each parameter group are independent • Data instances are independent given the unknown parameters

  27. Bayesian Nets & Bayesian Prediction Y|X X • We can also “read” from the network: Complete data  posteriors on parameters are independent • Can compute posterior over parameters separately! X[M] X[1] X[2] Y[M] Y[1] Y[2] Observed data

  28. Bayesian (Dirichlet) MLE Learning Parameters: Summary • Estimation relies on sufficient statistics • For multinomials: counts N(xi,pai) • Parameter estimation • Both are asymptotically equivalent and consistent • Both can be implemented in an on-line manner by accumulating sufficient statistics

  29. MLE Bayes; M'=20 Bayes; M'=50 Bayes; M'=5 Learning Parameters: Case Study 1.4 Instances sampled from ICU Alarm network 1.2 M’ — strength of prior 1 0.8 KL Divergence to true distribution 0.6 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # instances

  30. Overview • Introduction • Parameter Learning • Model Selection • Scoring function • Structure search • Structure Discovery • Incomplete Data • Learning from Structured Data

  31. Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Truth Earthquake Earthquake Alarm Set AlarmSet Burglary Burglary Earthquake Alarm Set Burglary Sound Sound Sound Why Struggle for Accurate Structure? Missing an arc Adding an arc

  32. E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Score­based Learning Define scoring function that evaluates how well a structure matches the data E B E E A A B A B Search for a structure that maximizes the score

  33. Likelihood Score for Structure • Larger dependence of Xion Pai higher score • Adding arcs always helps • I(X; Y)  I(X; {Y,Z}) • Max score attained by fully connected network • Overfitting: A bad idea… Mutual information between Xi and its parents

  34. Bayesian Score Likelihood score: Bayesian approach: • Deal with uncertainty by assigning probability to all possibilities Max likelihood params Marginal Likelihood Prior over parameters Likelihood

  35. Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form • P() is Dirichlet with hyperparameters 1,…,K • D is a dataset with sufficient statistics N1,…,NK Then

  36. H H 1 T T 2 T T 3 H H 4 T T 5 H H 6 H H 7 H H T T H H H H T T T T H H Marginal Likelihood: Bayesian Networks • Network structure determines form ofmarginal likelihood X Y Network 1: Two Dirichlet marginal likelihoods P( ) P( ) X Y Integral over X Integral over Y

  37. H H 1 T T 2 T T 3 H H 4 T T 5 H H 6 H H 7 H T H H T T H H T H H T T H Marginal Likelihood: Bayesian Networks • Network structure determines form ofmarginal likelihood X Y Network 2: Three Dirichlet marginal likelihoods P( ) P( ) P( ) X Y Integral over X Integral over Y|X=H Integral over Y|X=T

  38. Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(Xi | pai) N(..) are counts from the data (..) are hyperparameters for each family given G

  39. Bayesian Score: Asymptotic Behavior • As M (amount of data) grows, • Increasing pressure to fit dependencies in distribution • Complexity term avoids fitting noise • Asymptotic equivalence to MDL score • Bayesian score is consistent • Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty

  40. Structure Search as Optimization Input: • Training data • Scoring function • Set of possible structures Output: • A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G )

  41. MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Tree-Structured Networks Trees: • At most one parent per variable Why trees? • Elegant math • we can solve the optimization problem • Sparse parameterization • avoid overfitting

  42. Learning Trees • Let p(i) denote parent of Xi • We can write the Bayesian score as Score = sum of edge scores + constant Improvement over “empty” network Score of “empty” network

  43. Learning Trees • Set w(ji) =Score( Xj  Xi ) - Score(Xi) • Find tree (or forest) with maximal weight • Standard max spanning tree algorithm — O(n2 log n) Theorem: This procedure finds tree with max score

  44. Beyond Trees When we consider more complex network, the problem is not as easy • Suppose we allow at most two parents per node • A greedy algorithm is no longer guaranteed to find the optimal network • In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1

  45. Heuristic Search • Define a search space: • search states are possible structures • operators make small changes to structure • Traverse space looking for high-scoring structures • Search techniques: • Greedy hill-climbing • Best first search • Simulated Annealing • ...

  46. Local Search • Start with a given network • empty network • best tree • a random network • At each iteration • Evaluate all possible changes • Apply change based on score • Stop when no modification improves score

  47. S C E S C D E D S C S C E E D D Heuristic Search • Typical operations: Add C D To update score after local change, only re-score families that changed score = S({C,E} D) - S({E} D) Reverse C E Delete C E

  48. Structure known, fit params Learn both structure & params Learning in Practice: Alarm domain 2 1.5 KL Divergence to true distribution 1 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 #samples

  49. Local Search: Possible Pitfalls • Local search can get stuck in: • Local Maxima: • All one-edge changes reduce the score • Plateaux: • Some one-edge changes leave the score unchanged • Standard heuristics can escape both • Random restarts • TABU search • Simulated annealing

  50. Improved Search: Weight Annealing • Standard annealing process: • Take bad steps with probability  exp(score/t) • Probability increases with temperature • Weight annealing: • Take uphill steps relative to perturbed score • Perturbation increases with temperature Score(G|D) G

More Related