Create Presentation
Download Presentation

Download Presentation

Advanced Statistical Methods in NLP Ling 572 March 6, 2012

Download Presentation
## Advanced Statistical Methods in NLP Ling 572 March 6, 2012

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**EM**Advanced Statistical Methods in NLP Ling 572 March 6, 2012 Slides based on F. Xia11**Roadmap**• Motivation: • Unsupervised learning • Maximum Likelihood Estimation • EM: • Basic concepts • Main ideas • Example: Forward-backward algorithm**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States:**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations:**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities:**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities:**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get:**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get:**Motivation**• Task: Train a speech recognizer • Approach: Build a Hidden Markov Model • States: Phonemes • Observations: Acoustic speech signal • Transition probabilities: Phone sequence probabilities • Emission probabilities: Acoustic model probabilities • Training data: • Easy to get: lots and lots of recorded audio • Hard to get: Phonetic labeling of lots of recorded audio • Can we train our model without the ‘hard to get’ part?**Motivation**• Task: Train a probabilistic context-free grammar • Model:**Motivation**• Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get:**Motivation**• Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get:**Motivation**• Task: Train a probabilistic context-free grammar • Model: • Production rule probabilities • Probability of non-terminal rewriting • Training data: • Easy to get: lots and lots of text sentences • Hard to get: parse trees on lots of text sentences • Can we train our model without the ‘hard to get’ part?**Approach**• Unsupervised learning • EM approach: • Family of unsupervised parameter estimation techniques • General framework • Many specific algorithms implement: • Forward-Backward, Inside-Outside, IBM MT models, etc**EM**• Expectation-Maximization: • Two-step iterative procedure**EM**• Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation**EM**• Expectation-Maximization: • Two-step iterative procedure • General parameter estimation method: • Based on Maximum Likelihood Estimation • General form provided by (Dempster, Laird, Rubin ’77) • Unified framework • Specific instantiations predate**Maximum Likelihood Estimation**• MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ Based on F. Xia11**Maximum Likelihood Estimation**• MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) Based on F. Xia11**Maximum Likelihood Estimation**• MLE: • Given data: X = {X1,X2, …,Xn} • Parameters: Θ • Likelihood: P(X|Θ) • Log likelihood : L(Θ) = log P(X|Θ) • Maximum likelihood: • ΘML = argmaxΘ log P(X|Θ) Based on F. Xia11**MLE**• Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11**MLE**• Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11**MLE**• Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11**MLE**• Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11**MLE**• Assume data X is independently identically distributed (i.i.d.): • Difficulty of computing max depends on form Based on F. Xia11**Simple Example**• Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a set of N coin flips, m are heads • Data X Based on F. Xia11**Simple Example**• Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ Based on F. Xia11**Simple Example**• Coin flipping: • Single coin: Probability of heads: p; tail: 1-p • Consider a sequence of N coin flips, m are heads • Data X: Coin flip sequence e.g. X={H,T,H} • Parameter(s) Θ: p • What value of p maximizes probability of data? Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) = Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11**Simple Example, Formally**• L(Θ) = log P(X|Θ) = log pm * (1-p)(N-m) • L(Θ) = log pm + log (1-p)(N-m) =m log p + (N-m)log(1-p) Based on F. Xia11**EM**• General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ Based on F. Xia11**EM**• General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) Based on F. Xia11**EM**• General setting: • Data X = {X1,X2,…,Xn} • Parameter vector θ • EM provides method to compute: • θML= argmax_L(θ) • = argmax log P(X|θ) • In many cases, computing P(X|θ) is hard • However, computing P(X,Y|θ) can be easier Based on F. Xia11**Terminology**• Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data**Terminology**• Z = (X,Y) • Z is the ‘complete’/’augmented’ data • X is the ‘observed’/’incomplete’ data • Y is the ‘hidden’/’missing’ data • Articles mix the labels and terms**Forms of EM**Based on F. Xia11**Forms of EM**Based on F. Xia11**Bird’s Eye View of EM**• Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set**Bird’s Eye View of EM**• Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y**Bird’s Eye View of EM**• Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y**Bird’s Eye View of EM**• Start with some initial setting of the model • Small random values • Parameters trained on small hand-labeled set • Use current model to estimate hidden data Y • Update the model parameters based on X,Y • Iterate until convergence**Key Features of EM**• General framework for ‘hidden’ data problems • General iterative methodology • Must be specialized to particular problems: • Forward-Backward for HMMs • Inside-Outside for PCFGs • IBM models for MT**Maximum Likelihood**• EM performs parameter estimation for maximum likelihood estimation: • ΘML = argmax L(Θ) • ΘML = argmax log P(X|Θ) • Introduces ‘hidden’ data Y to allow more tractable solution