1 / 28

Sequential Learning with Dependency Nets

Sequential Learning with Dependency Nets. William W. Cohen 2/22. CRFs: the good, the bad, and the cumbersome…. Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training)

darrylv
Télécharger la présentation

Sequential Learning with Dependency Nets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequential Learning with Dependency Nets William W. Cohen 2/22

  2. CRFs: the good, the bad, and the cumbersome… • Good points: • Global optimization of weight vector that guides decision making • Trade off decisions made at different points in sequence • Worries: • Cost (of training) • Complexity (do we need all this math?) • Amount of context: • Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. • Strong commitment to maxent-style learning • Loglinear models are nice, but nothing is always best.

  3. Dependency Nets

  4. Proposed solution: • parents of node are the Markov blanket • like undirected Markov net • capture all “correlational associations” • one conditional probability for each node X, namely P(X|parents of X) • like directed Bayes net–no messy clique potentials

  5. Example – bidirectional chains Y1 Y2 … Yi … Cohen post the When will dr notes

  6. DN chains Y1 Y2 … Yi … Cohen post the When will dr notes • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random • Pick some j, and compute • Set new value of Yj according to this • Go back to (2) Current values

  7. This an MCMC process Transition probability General case … … Markov Chain Monte Carlo: a randomized process that doesn’t depend on previous y’s changes y(t) to y(t+1) One particular run … … • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random: y(0) • Pick some j, and compute • Set new value of Yj according to this: y(1) • Go back to (2) and repeat to get y(1) , y(2) , …, y(t) , … Current values (t)

  8. This an MCMC process … … Claim: suppose Y(t) is drawn from some distribution D such that Then Y(t+1) is also drawn from D (i.e., the random flip doesn’t move us “away from D”

  9. This an MCMC process … … “Burn-in” Claim: if you wait long enough then for some t,Y(t) will be drawn from some distribution D such that …under certain reasonable conditions (e.g., graph of potential edges is connected, …). So D is a “sink”.

  10. This an MCMC process … … averaged for prediction “burn-in” - discarded • An algorithm: • Run the MCMC chain for a long time t, and hope that Y(t) will be drawn from the target distribution D. • Run the MCMC chain for a while longer and save sample S = { Y(t) , Y(t+1) , …, Y(t+m) } • Use S to answer any probabilistic queries like Pr(Yj|X)

  11. More on MCMC • This particular process is Gibbs sampling • Transition probabilities are defined by sampling from the posterior of one variable Yj given the others. • MCMC is very general-purpose inference scheme (and sometimes very slow) • On the plus side, learning is relatively cheap, since there’s no inference involved (!) • A dependency net is closely related to a Markov random field learned by maximizing pseudo-likelihood • Identical? • Statistical relation learning community has some proponents of this approach: • Pedro Domingos, David Jensen, … • A big advantage is the generality of the approach • Sparse learners (eg L1 regularized maxent, decision trees, …) can be used to infer Markov blanket (NIPS 2006)

  12. Examples Y1 Y2 … Yi … Cohen post the When will dr notes

  13. Examples POS? Z1 Z2 … Zi … Y1 Y2 … Yi … BIO Cohen post the When will dr notes

  14. Examples Y1 Y2 … Yi … Cohen post the When will dr notes

  15. Dependency nets • The bad and the ugly: • Inference is less efficient –MCMC sampling • Can’t reconstruct probability via chain rule • Networks might be inconsistent • ie local P(x|pa(x))’s don’t define a pdf • Exactly equal, representationally, to normal undirected Markov nets

  16. Dependency nets • The good: • Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. • (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) • Inference can be speeded up substantially over naïve Gibbs sampling.

  17. Dependency nets • Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x

  18. Toutanova, Klein, Manning, Singer • Dependency nets for POS tagging vs CMM’s. • Maxent is used for local conditional model. • Goals: • An easy-to-train bidirectional model • A really good POS tagger

  19. Toutanova et al • Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1

  20. Results with model

  21. Results with model

  22. Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”

  23. Results with model Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 (Ratnaparki) (Lafferty et al ICML2001)

  24. Other comments • Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere

More Related