Advanced Statistical Methods in NLP Ling 572 March 6, 2012

1 / 89

# Advanced Statistical Methods in NLP Ling 572 March 6, 2012

## Advanced Statistical Methods in NLP Ling 572 March 6, 2012

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. EM Advanced Statistical Methods in NLP Ling 572 March 6, 2012 Slides based on F. Xia11

2. Roadmap • Motivation: • Unsupervised learning • Maximum Likelihood Estimation • EM: • Basic concepts • Main ideas • Example: Forward-backward algorithm

3. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model

4. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States:

5. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations:

6. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities:

7. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities:

8. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get:

9. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get:

10. Motivation • Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get: Phonetic labeling of lots of recorded audio • Can we train our model without the ‘hard to get’ part?

11. Motivation • Task: Train a probabilistic context-free grammar • Model:

12. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get:

13. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get:

14. Motivation • Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get: parse trees on lots of text sentences • Can we train our model without the ‘hard to get’ part?

15. Approach • Unsupervised learning • EM approach: • Family of unsupervised parameter estimation techniques • General framework • Many specific algorithms implement: • Forward-Backward, Inside-Outside, IBM MT models, etc

16. EM • Expectation-Maximization: • Two-step iterative procedure

17. EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation

18. EM • Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation • General form provided by (Dempster, Laird, Rubin ’77) • Unified framework • Specific instantiations predate

19. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ Based on F. Xia11

20. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) Based on F. Xia11

21. Maximum Likelihood Estimation • MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) • Maximum likelihood: • ΘML = argmaxΘ log P(X|Θ) Based on F. Xia11

22. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

23. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

24. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

25. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

26. MLE • Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11

27. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a set of N coin flips, m are heads • Data X Based on F. Xia11

28. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ Based on F. Xia11

29. Simple Example • Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ: p • What value of p maximizes probability of data? Based on F. Xia11

30. Simple Example, Formally • L(Θ) = log P(X|Θ) Based on F. Xia11

31. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) Based on F. Xia11

32. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) = Based on F. Xia11

33. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

34. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

35. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

36. Simple Example, Formally • L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11

37. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ Based on F. Xia11

38. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) Based on F. Xia11

39. EM • General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) • In many cases, computing P(X|θ) is hard • However, computing P(X,Y|θ) can be easier Based on F. Xia11

40. Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data

41. Terminology • Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data • Articles mix the labels and terms

42. Forms of EM Based on F. Xia11

43. Forms of EM Based on F. Xia11

44. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set

45. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y

46. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y

47. Bird’s Eye View of EM • Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y • Iterate until convergence

48. Key Features of EM • General framework for ‘hidden’ data problems • General iterative methodology • Must be specialized to particular problems: • Forward-Backward for HMMs • Inside-Outside for PCFGs • IBM models for MT

49. Mains Ideas in EM

50. Maximum Likelihood • EM performs parameter estimation for maximum likelihood estimation: • ΘML = argmax L(Θ) • ΘML = argmax log P(X|Θ) • Introduces ‘hidden’ data Y to allow more tractable solution