1 / 49

Learning Tree Conditional Random Fields

Learning Tree Conditional Random Fields. Joseph K. Bradley Carlos Guestrin. Reading people’s minds. predict. Correlated!. E.g., Person? & Live in water? Colorful? & Yellow?. (Application from Palatucci et al., 2009). X : fMRI voxels. Y : semantic features. Metal? Manmade?

Télécharger la présentation

Learning Tree Conditional Random Fields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin

  2. Reading people’s minds predict Correlated! • E.g., • Person? & Live in water? • Colorful? & Yellow? (Application from Palatucci et al., 2009) X: fMRI voxels Y: semantic features • Metal? • Manmade? • Found in house? • ... We want to model conditional correlations Predict independently? Yi ~ X, for all i Image from http://en.wikipedia.org/wiki/File:FMRI.jpg

  3. Conditional Random Fields (CRFs) In fMRI, X ≈ 500 to 10,000 voxels Pro: Avoid modeling P(X) (Lafferty et al., 2001)

  4. Conditional Random Fields (CRFs) encode conditional independence structure Y4 Y3 Y2 Y1 Pro: Avoid modeling P(X)

  5. Conditional Random Fields (CRFs) encode conditional independence structure Y4 Y3 Y2 Y1 Pro: Avoid modeling P(X)

  6. Conditional Random Fields (CRFs) Y4 Y3 Y2 Normalization depends on X=x Y1 Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)

  7. Conditional Random Fields (CRFs) Y4 Y3 Y2 Y1 Exact inference intractable in general. Approximate inference expensive. Use tree CRFs! Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)

  8. Conditional Random Fields (CRFs) Y4 Y3 Y2 Y1 Use tree CRFs! Pro: Fast, exact inference Con: Compute Z(x) for each inference Pro: Avoid modeling P(X)

  9. CRF Structure Learning Y4 Y3 Structure learning Y2 Y1 Feature selection Tree CRFs Fast, exact inference Avoid modeling P(X)

  10. CRF Structure Learning instead of Global inputs (not scalable) (scalable) Local inputs Tree CRFs Fast, exact inference Avoid modeling P(X)

  11. This work Goals: • Structured conditional models P(Y|X) • Scalable methods • Tree structures • Local inputs Xij • Max spanning trees Outline • Gold standard • Max spanning trees • Generalized edge weights • Heuristic weights • Experiments: synthetic & fMRI

  12. Related work • Vs. our work • Choice of edge weights • Local inputs

  13. Chow-Liu For generative models:

  14. Chow-Liu for CRFs? For CRFs with global inputs: Global CMI (Conditional Mutual Information): Pro: “Gold standard” Con: I(Yi;Yj | X) intractable for big X

  15. Where now? Global CMI (Conditional Mutual Information): Pros: “Gold standard” Cons: I(Yi;Yj | X) intractable for big X Algorithmic framework • Given: data {(y(i),x(i))}. • Given: input mapping Yi Xi • Weight potential edge (Yi,Yj) with Score(i,j) • Choose max spanning tree Local inputs!

  16. Generalized edge scores Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj E.g., Local Conditional Mutual Information Key step: Weight edge (Yi,Yj) with Score(i,j).

  17. Generalized edge scores Local Linear Entropy Scores: Score(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem Assume true P(Y|X) is tree CRF (w/ non-trivial parameters). No Local Linear Entropy Score can recover all such tree CRFs (even with exact entropies). Key step: Weight edge (Yi,Yj) with Score(i,j).

  18. Heuristics Outline • Gold standard • Max spanning trees • Generalized edge weights • Heuristic weights • Experiments: synthetic & fMRI  Piecewise likelihood Local CMI DCI

  19. Piecewise likelihood (PWL) Sutton and McCallum (2005,2007): PWL for parameter learning Main idea: Bound Z(X) For tree CRFs, optimal parameters give: Fails on simple counterexample Does badly in practice Helps explain other edge scores Edge score w/ local inputs Xij Bounds log likelihood

  20. Piecewise likelihood (PWL) True P(Y,X) Y1 Y2 Y3 Yn ... Strong potential  Choose (2,j) Over (j,k) X1 X2 X3 Xn Y3 Y1 Y2 Yn

  21. Local Conditional Mutual Info Decomposable score w/ local inputs Xij Theorem: Local CMI bounds log likelihood gain • Does pretty well in practice • Can fail with strong potentials

  22. Local Conditional Mutual Info True P(Y,X) Y1 Y2 Y3 Yn ... Strong potential  X1 X2 X3 Xn Y3 Y2 Y1

  23. Decomposable Conditional Influence (DCI) PWL From Y2 Y1 Y3 • Exact measure of gain for some edges • Edge score w/ local inputs Xij • Succeeds on counterexample • Does best in practice

  24. Experiments Algorithmic details Regress P(Yij|Xij) (10-fold CV to choose regularization) • Choose max spanning tree • Parameter learning: • Conjugate gradient on L2-regularized log likelihood • 10-fold CV to choose regularization Given: Data {(yi,xi)}; input mapping Yi Xi Compute edge scores:

  25. Synthetic experiments P(Y|X) P(X) Y1 Y2 Y3 Yn ... ... X1 X2 X3 Xn X1 X2 X3 Xn Experiments: Binary Y,X; tabular edge factors Use natural input mapping: Yi Xi

  26. Synthetic experiments P(Y|X) P(X) Y4 X4 Y3 X3 Y2 X2 Y1 X1 Y5 X5 tractable P(Y,X) intractable P(Y,X) P(Y,X): tractable & intractable X4 X5 Φ(Yij,Xij): X3 X2 X1 P(Y|X), P(X): chains & trees

  27. Synthetic experiments P(Y|X) Y1 Y2 Y3 Yn ... cross factors X1 X2 X3 Xn P(Y,X): tractable & intractable Φ(Yij,Xij): With & without cross-factors Associative (all positive & alternating +/-) & random factors P(Y|X): chains & trees

  28. Synthetic: vary # train exs.

  29. Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

  30. Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

  31. Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

  32. Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

  33. Synthetic: vary # train exs. Tree Intractable P(Y,X) Associative Φ (alternating +/-) |Y|=40 1000 test examples

  34. Synthetic: vary # train exs.

  35. Synthetic: vary model size Fixed 50 train exs., 1000 test exs.

  36. fMRI experiments Decode (hand-built map) Object (60 total) • Bear • Screwdriver • ... X (500 fMRI voxels) Y (218 semantic features) predict • Metal? • Manmade? • Found in house? • ... Data, setup from Palatucci et al. (2009) Zero-shot learning: Can predict objects not in training data (given decoding). Image from http://en.wikipedia.org/wiki/File:FMRI.jpg

  37. fMRI experiments Y,X real-valued  Gaussian factors: Regularized A & C,b separately  CV for parameter learning very expensive  Do CV on subject 0 only 2 methods: CRF1: K=10 & CRF2: K=20 & Added fixed X (500 fMRI voxels) Y (218 semantic features) predict Input mapping: Regressed Yi ~ Y-i,X Chose top K inputs

  38. fMRI experiments Accuracy: (for zero-shot learning) Hold out objects i,j. Predict Y(i)’, Y(j)’ If ||Y(i) - Y(i)’||2 < ||Y(j) - Y(i)’||2 then we got i right.

  39. fMRI experiments Accuracy: CRFs a bit worse better

  40. fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better better

  41. fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better better

  42. fMRI experiments Accuracy: CRFs a bit worse Log likelihood: CRFs better Squared error: CRFs better better

  43. Conclusion • Scalable learning of CRF structure • Analyzed edge scores for spanning tree methods • Local Linear Entropy Scores imperfect • Heuristics • Pleasing theoretical properties • Empirical success—we recommend DCI Future work • Templated CRFs • Learning edge score • Assumptions on model/factors which give learnability Thank you!

  44. Thank you! References M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998. Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009. M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008. D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009. C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005. C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007. A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

  45. (extra slides)

  46. B: Score Decay Assumption

  47. B: Example complexity

  48. Future work: Templated CRFs WebKB (Craven et al., 1998) Given webpages {(Yi=page type, Xi=content)} Use template to: Choose tree over pages Instantiate parameters  P(Y|X=x) = P(pages’ types | pages’ content) Requires local inputs Potentially very fast Learn template, e.g. Score(i,j) = DCI(i,j) Parametrization

  49. Future work: Learn score Given training queries: Data Ground-truth model (E.g., from expensive structure learning method) Learn function Score(Yi,Yj) for MST algorithm.

More Related