280 likes | 291 Vues
Sequential Learning with Dependency Nets. William W. Cohen 2/22. CRFs: the good, the bad, and the cumbersome…. Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training)
E N D
Sequential Learning with Dependency Nets William W. Cohen 2/22
CRFs: the good, the bad, and the cumbersome… • Good points: • Global optimization of weight vector that guides decision making • Trade off decisions made at different points in sequence • Worries: • Cost (of training) • Complexity (do we need all this math?) • Amount of context: • Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. • Strong commitment to maxent-style learning • Loglinear models are nice, but nothing is always best.
Proposed solution: • parents of node are the Markov blanket • like undirected Markov net • capture all “correlational associations” • one conditional probability for each node X, namely P(X|parents of X) • like directed Bayes net–no messy clique potentials
Example – bidirectional chains Y1 Y2 … Yi … Cohen post the When will dr notes
DN chains Y1 Y2 … Yi … Cohen post the When will dr notes • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random • Pick some j, and compute • Set new value of Yj according to this • Go back to (2) Current values
This an MCMC process Transition probability General case … … Markov Chain Monte Carlo: a randomized process that doesn’t depend on previous y’s changes y(t) to y(t+1) One particular run … … • How do we do inference? Iteratively: • Pick values for Y1, Y2, …at random: y(0) • Pick some j, and compute • Set new value of Yj according to this: y(1) • Go back to (2) and repeat to get y(1) , y(2) , …, y(t) , … Current values (t)
This an MCMC process … … Claim: suppose Y(t) is drawn from some distribution D such that Then Y(t+1) is also drawn from D (i.e., the random flip doesn’t move us “away from D”
This an MCMC process … … “Burn-in” Claim: if you wait long enough then for some t,Y(t) will be drawn from some distribution D such that …under certain reasonable conditions (e.g., graph of potential edges is connected, …). So D is a “sink”.
This an MCMC process … … averaged for prediction “burn-in” - discarded • An algorithm: • Run the MCMC chain for a long time t, and hope that Y(t) will be drawn from the target distribution D. • Run the MCMC chain for a while longer and save sample S = { Y(t) , Y(t+1) , …, Y(t+m) } • Use S to answer any probabilistic queries like Pr(Yj|X)
More on MCMC • This particular process is Gibbs sampling • Transition probabilities are defined by sampling from the posterior of one variable Yj given the others. • MCMC is very general-purpose inference scheme (and sometimes very slow) • On the plus side, learning is relatively cheap, since there’s no inference involved (!) • A dependency net is closely related to a Markov random field learned by maximizing pseudo-likelihood • Identical? • Statistical relation learning community has some proponents of this approach: • Pedro Domingos, David Jensen, … • A big advantage is the generality of the approach • Sparse learners (eg L1 regularized maxent, decision trees, …) can be used to infer Markov blanket (NIPS 2006)
Examples Y1 Y2 … Yi … Cohen post the When will dr notes
Examples POS? Z1 Z2 … Zi … Y1 Y2 … Yi … BIO Cohen post the When will dr notes
Examples Y1 Y2 … Yi … Cohen post the When will dr notes
Dependency nets • The bad and the ugly: • Inference is less efficient –MCMC sampling • Can’t reconstruct probability via chain rule • Networks might be inconsistent • ie local P(x|pa(x))’s don’t define a pdf • Exactly equal, representationally, to normal undirected Markov nets
Dependency nets • The good: • Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. • (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) • Inference can be speeded up substantially over naïve Gibbs sampling.
Dependency nets • Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x
Toutanova, Klein, Manning, Singer • Dependency nets for POS tagging vs CMM’s. • Maxent is used for local conditional model. • Goals: • An easy-to-train bidirectional model • A really good POS tagger
Toutanova et al • Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1
Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”
Results with model Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 (Ratnaparki) (Lafferty et al ICML2001)
Other comments • Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere