1 / 26

Learning CRFs with Hierarchical Features: An Application to Go

Learning CRFs with Hierarchical Features: An Application to Go. – University of Toronto Microsoft Research. Scott Sanner Thore Graepel Ralf Herbrich Tom Minka. (with thanks to David Stern and Mykel Kochenderfer). TexPoint fonts used in EMF.

hachi
Télécharger la présentation

Learning CRFs with Hierarchical Features: An Application to Go

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning CRFs with Hierarchical Features: An Application to Go – University of Toronto Microsoft Research Scott Sanner Thore Graepel Ralf Herbrich Tom Minka (with thanks to David Stern and Mykel Kochenderfer) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAA

  2. Black territory White territory 1 • A stone is captured by surrounding it The Game of Go • Started about 4000 years ago in ancient China • About 60 million players worldwide • 2 Players: Black and White • Board: 19×19 grid • Rules: • Turn: One stone placed on vertex. • Capture. • Aim: Gather territory by surrounding it

  3. Territory Prediction • Goal: predict territory distribution given board... • How to predict territory? • Could use simulated play: • Monte Carlo averaging is an excellent estimator • Avg. 180 moves/turn, >100 moves/game  costly • We learn to directly predict territory: • Learn from expert data

  4. Talk Outline • Hierarchical pattern features • Independent pattern-based classifiers • Best way to combine features? • CRF models • Coupling factor model (w/ patterns) • Best training / inference approximationto circumvent intractability? • Evaluation and Conclusions

  5. Hierarchical Patterns • Centered on a single position • Exact config.of stones • Fixed match region (template) • 8 nested templates • 3.8 million patterns mined

  6. Vertex Variables: Unary pattern-based factors: Couplingfactors: Models (a) Independent pattern-based classifiers (b) CRF (c) Pattern CRF

  7. Independent Pattern-based Classifiers

  8. Inference and Training • Up to 8 pattern sizes may match at any vertex • Which pattern to use? • Smallest pattern • Largest pattern • Or, combine all patterns: • Logistic regression • Bayesian model averaging…

  9. Bayesian Model Averaging • Bayesian approach to combining models: • Now examine the model “weight”: • Model  must apply to all data!

  10. Hierarchical Tree Models • Arrange patterns into decision trees i: • Model i provides predictions on all data

  11. CRF& Pattern CRF

  12. Inference and Training • Inference • Exact is slow for 19x19 grids • Loopy BP is faster • but biased • Sampling is unbiased • but slower than Loopy BP • Training • Max likelihood requires inference! • Other approximate methods…

  13. Pseudolikelihood • Standard log-likelihood: • Edge-based pseudo log-likelihood: • Then inference during training is purely local • Long range effects captured in data • Note: only valid for training • in presence of fully labeled data

  14. Local Training • Piecewise: • Shared Unary Piecewise:

  15. Evaluation

  16. Models & Algorithms • Model & algorithm specification: • Model / Training (/ Inference, if not obvious) • Models & algorithms evaluated: • Indep / {Smallest, Largest} Pattern • Indep / BMA-Tree {Uniform, Exp} • Indep / Log Regr • CRF / ML Loopy BP (/ Swendsen-Wang) • Pattern CRF / Pseudolikelihood (Edge) • Pattern CRF / (S. U.) Piecewise • Monte Carlo

  17. Training Time • Approximate time for various models and algorithms to reach convergence:

  18. Inference Time • Average time to evaluate for various models and algorithms on a 19x19 board:

  19. Performance Metrics • Vertex Error: (classification error) • Net Error: (score error) • Log Likelihood: (model fit)

  20. Performance Tradeoffs I

  21. Why is Vertex Error better for CRFs? • Coupling factors help realize stable configurations • Compare previous unary-only independent model to unary and coupling model: • Independent models make inconsistent predictions • Loopy BP smoothes these predictions (but too much?) BMA-Tree Model Coupling Model with Loopy BP

  22. Why is Net Error worse for CRFs? • Use sampling to examine bias of Loopy BP • Unbiased inference in limit • Can run over all test data but still too costly for training • Smoothing gets rid of local inconsistencies • But errors reinforce each other! Loopy Belief Propagation Swendsen-Wang Sampling

  23. Bias of Local Training • Problems with Piecewise training: • Very biased when used in conjunction with Loopy BP • Predictions good (low Vertex Error), just saturated • Accounts for poor Log Likelihood & Net Error… ML Trained Piecewise Trained

  24. Performance Tradeoffs II

  25. Conclusions Two general messages: (1) CRFs vs. Independent Models: • Pattern CRFs should theoretically be better • However, time cost is high • Can save time with approximate training / inference • But then CRFs may perform worse than independent classifiers – depends on metric (2) For Independent Models: • Problem of choosing appropriate neighborhood can be finessed by Bayesian model averaging

  26. Thank you!Questions?

More Related