slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Statistical Methods in NLP Ling 572 March 6, 2012 PowerPoint Presentation
Download Presentation
Advanced Statistical Methods in NLP Ling 572 March 6, 2012

play fullscreen
1 / 89

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

97 Views Download Presentation
Download Presentation

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. EM Advanced Statistical Methods in NLP Ling 572 March 6, 2012 Slides based on F. Xia11

  2. Roadmap • Motivation: • Unsupervised learning • Maximum Likelihood Estimation • EM: • Basic concepts • Main ideas • Example: Forward-backward algorithm

  3. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model

  4. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States:

  5. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations:

  6. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities:

  7. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities:

  8. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get:

  9. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get:

  10. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get: Phonetic labeling of lots of recorded audio • Can we train our model without the ‘hard to get’ part?

  11. Motivation • Task: Train a probabilistic context-free grammar • Model:

  12. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get:

  13. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get:

  14. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get: parse trees on lots of text sentences • Can we train our model without the ‘hard to get’ part?

  15. Approach • Unsupervised learning • EM approach: • Family of unsupervised parameter estimation techniques • General framework • Many specific algorithms implement: • Forward-Backward, Inside-Outside, IBM MT models, etc

  16. EM • Expectation-Maximization: • Two-step iterative procedure

  17. EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation

  18. EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation • General form provided by (Dempster, Laird, Rubin ’77) • Unified framework • Specific instantiations predate

  19. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ Based on F. Xia11

  20. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) Based on F. Xia11

  21. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) • Maximum likelihood: • ΘML = argmaxΘ log P(X|Θ) Based on F. Xia11

  22. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  23. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  24. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  25. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  26. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

  27. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a set of N coin flips, m are heads • Data X Based on F. Xia11

  28. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ Based on F. Xia11

  29. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ: p • What value of p maximizes probability of data? Based on F. Xia11

  30. Simple Example, Formally • L(Θ) = log P(X|Θ) Based on F. Xia11

  31. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) Based on F. Xia11

  32. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) = Based on F. Xia11

  33. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  34. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  35. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  36. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

  37. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ Based on F. Xia11

  38. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) Based on F. Xia11

  39. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) • In many cases, computing P(X|θ) is hard • However, computing P(X,Y|θ) can be easier Based on F. Xia11

  40. Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data

  41. Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data • Articles mix the labels and terms

  42. Forms of EM Based on F. Xia11

  43. Forms of EM Based on F. Xia11

  44. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set

  45. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y

  46. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y

  47. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y • Iterate until convergence

  48. Key Features of EM • General framework for ‘hidden’ data problems • General iterative methodology • Must be specialized to particular problems: • Forward-Backward for HMMs • Inside-Outside for PCFGs • IBM models for MT

  49. Mains Ideas in EM

  50. Maximum Likelihood • EM performs parameter estimation for maximum likelihood estimation: • ΘML = argmax L(Θ) • ΘML = argmax log P(X|Θ) • Introduces ‘hidden’ data Y to allow more tractable solution