LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 15 • 3/2/2013

Recommended reading • Jurafsky & Martin • Chapter 5 through 5.7, Part of Speech Tagging • Chapter 6 through 6.5, Hidden Markov Models • Toutanova & Manning (2000): error analysis for POS tagging

Outline • HMM review • The 3 HMM problems: evaluation • The 3 HMM problems: decoding • Maximum likelihood estimation (optional) • The 3 HMM problems: learning (optional) • Evaluation of POS tagging

Parameters of a HMM • States: a set of states S=s1 , …, sn • Initial state distribution: πi is the probability that si is a start state • Transition probabilities: A = a1,1 , a1,2 , …, an,n Each ai,jis the prob. of transitioning from state sito sj • Emission probabilities: B = b1(ot), b2(ot), …, bn(ot) bi(ot)is the prob. of emitting observation otat state si

Toy HMM(will work through computations later) .05 .1 .05 DT .5 VB .4 .1 .6 .9 NN .3

Same, in terms of HMM parameters .05 .1 .05 0, DT .5 1, VB .4 .1 .6 .9 2, NN .3

Path through HMM • A path through the HMM generates an observation sequence O and a hidden state sequence S. • Training data: pairs of (O, S) • POS tagging: given O, infer the most likely S

POS tagging using a HMM • We are given W = w1, …, wn, and want to find the tag seq. T = t1, …, tn that maximizes p(T | W): • According to parameters of HMM:

The three basic HMM problems • Problem 1 (Evaluation): Given the observation sequence O=o1, …, oTand an HMM λ = (A, B, π), how do we compute p(O|λ), the prob. of O given the model? • A = state transition probs. • B = symbol emission probs. • π = initial state probs. • Problem 2 (Decoding): Given the observation sequence O=o1, …, oTand an HMM λ = (A, B, π), how do we find argmaxSp(S|O, λ), the state sequence S that best explains the observations?

The three basic HMM problems • Problem 3 (Learning): How do we set the model parameters λ = (A, B, π), to maximize p(O| λ)? • If supervised tagging: • Read parameters directly off of labeled training data • If unsupervised tagging: • We are given observation sequences only, and don’t know the corresponding hidden sequences • Apply Baum-Welch algorithm

Example • Evaluation: what is the probability of “heat the oil” given the HMM? p(heat the oil | λ ) • Problem: many possible POS tag state sequences for this observation sequence • Decoding: what is the most likely POS tag state sequence for “heat the oil”? argmaxT p(T | heat the oil) = argmaxT p(heat the oil | T)*p(T) = argmaxT p(heat the oil, T)

Problem 1: probability of an observation sequence • What is p(O|λ)? • The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM that generate this observation sequence. • For example, p(‘heat the oil’ | λ ) is the sum of probabilities of all possible tag sequences for the sentence, such as VB DT NN and NN DT NN

Problem 1: probability of an observation sequence • What is p(O|λ)? • The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM that generate this observation sequence. • Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. • Even for small HMMs; e.g. when T=10 and N=10, there are 10 billion different paths • Solution: use dynamic programming • Can calculate either the forward probability or the backward probability

Dynamic programming • General algorithm procedure for cases where: • Exponential number of cases to consider • Problem can be broken apart into subproblems • Recursively computed solutions to subproblems can be used to solve larger problems • Solutions to subproblems are used multiple times

HMM trellis: shows the set of states at each time tsi,j = state j at time i

Forward probabilities • Given an HMM λ, what is the probability that at time t the state is i, and the partial observation sequence o1 , …, ot is generated?

Forward probability recurrence o: observation t: time i: source state j: destination state a: prob of state transition alpha: forward prob for a state b: emission prob

Forward Algorithm • Initialization: • Induction: • At termination, prob of observation is sum of probs of all forward paths for that observation: where

Forward algorithm complexity • Naïve approach (enumerating all possible paths) is O(2T*NT) • Forward algorithm is O(N2T) because of dynamic programming

Example HMM .05 .1 .05 0, DT .5 1, VB .4 .1 .6 .9 2, NN .3

Prob. of “heat the oil” given the HMM • Marginalize over final states: • Rewrite in terms of alpha:

Work through on board:Compute p(heat the oil | λ) • Inputs: • HMM λ = (π, a, b) • o = heat the oil • Procedure: • Draw trellis • T = 3 because there are 3 observations in o • Compute forward probabilities αt(i) for each time step t and each state i, given the observations o • Use forward algorithm • Return Σi α3(i)

Forward Algorithm again • Initialization: • Induction: • At termination, prob of observation is sum of probs of all forward paths for that observation: where

(optional) Backward probabilities • Given an HMM λ, what is the probability that, at time t the state is i, and the partial observation sequence ot+1 … oTis generated? • Analogous to forward probability, just in the other direction

(optional) Backward probability recurrence o: observation t: time i: source state j: destination state a: prob of state transition beta: backward prob for a state b: emission prob

(optional) Backward algorithm • Initialization: • Induction: • At termination, prob of observation is sum of probs of all backward paths for that observation: where

Forward and backward algorithms • At termination, prob of observation is sum of probs of all forward paths for that observation: • At termination, prob of observation is sum of probs of all backward paths for that observation:

Problem 2: decoding • Given the observation sequence O = o1 , …, oTand an HMM, how do we find the state sequence that best explains the observations? • We want to find the state sequence Q=q1, …, qTthat maximizes its conditional probability given: • O, the observation sequence, and • λ, the parameters of the HMM • Example: argmaxQ’ p(Q’| ‘heat the oil’, λ) = most likely tag sequence for the sentence

Viterbi algorithm: probability of most likely path given an observation • Similar to computing forward probabilities, but instead of summing over transitions from incoming states, compute the maximum • Forward: • Viterbi:

Core idea of Viterbi algorithm o: observation t: time i: source state j: destination state a: prob of state transition delta: viterbiprob for a state b: emission prob

Which path was most likely? • Viterbi algorithm: probability of most likely path given an observation • Doesn’t tell you which path was most likely • Need to also keep track of Ψt(j) = i • For the current state j at time t, which previous state i had led to the most likely path to the current state • Backpointers • Compute qt : after path probabilities have been computed, recursively follow backpointers to recover the most-likely path through time t

Viterbi algorithm • Initialization: • Induction: • Termination: • Read out path:

Example HMM .05 .1 .05 0, DT .5 1, VB .4 .1 .6 .9 2, NN .3

Demonstrate Viterbi algorithm on board

Let’s toss a coin • 1000 trials • 300 heads • 700 tails • p(heads) = .3 • This is the maximum likelihood estimate • What does this mean?

Let’s toss a coin 3 times, twice • Suppose we know how the coin is weighted: p(head) = .3 • Therefore, p(tail) = 1 - p(head) = .7 • Two independent trials: • Toss it 3 times: get head, tail, tail p(head, tail, tail) = .3 * .7 * .7 = .147 • Toss it 3 times: get tail, tail, head p(tail, tail, head) = .7 * .7 * .3 = .147

Likelihood • Prob. of set of observations O: • To calculate this, we need the parameters of the model λthat generated the observations • Rewrite this as the likelihood function, the joint prob. of our observations given model params:

Likelihood for our set of observations • Suppose: O = { <head, tail, tail>, <tail, tail, head> } λ: p(head) = .3 = (.3*.7*.7) * (.7*.7*.3) = 0.294

Maximum likelihood estimate • Suppose we don’t know λ • p(head) is unknown. • We can make a guess about λ and then calculate the likelihood of the data. • The “best” guess is the maximum likelihood estimate, which is the set of parameter values λ that maximizes L(O, λ)

Value of λ that maximizes the likelihood is the true probability • O = { <head, tail, tail>, <tail, tail, head> } • Make a guess about λ • If we guess that p(head) = 0.2, then L(O, λ) = (.2*.8*.8) * (.8*.8*.2) = 0.256 • If we guess that p(head) = 0.3, then L(O, λ) = (.3*.7*.7) * (.7*.7*.3) = 0.294  • If we guess that p(head) = 0.4, then L(O, λ) = (.4*.6*.6) * (.6*.6*.4) = 0.288 • If we guess that p(head) = 0.5, then L(O, λ) = (.5*.5*.5) * (.5*.5*.5) = 0.250

Maximization of log L(O, λ) also maximizes L(O, λ) • O = { <head, tail, tail>, <tail, tail, head> } • Make a guess about λ • If we guess that p(head) = 0.2, then log L(O, λ) = log((.2*.8*.8) * (.8*.8*.2)) = -1.3626 • If we guess that p(head) = 0.3, then log L(O, λ) = log((.3*.7*.7) * (.7*.7*.3)) = -1.2242  • If we guess that p(head) = 0.4, then log L(O, λ) = log((.4*.6*.6) * (.6*.6*.4)) = -1.2448 • If we guess that p(head) = 0.5, then log L(O, λ) = log((.5*.5*.5) * (.5*.5*.5)) = -1.386 • Equivalently, minimize the negative log-likelihood

The three basic HMM problems • Problem 3 (Learning): How do we set the model parameters λ = (A, B, π), to maximize p(O| λ)? • If doing supervised learning: • Just read parameters directly off of labeled training data • If doing unsupervised learning : • We are given observation sequences only, and don’t know their corresponding hidden sequences • Apply Baum-Welch algorithm

Problem 3: unsupervised learning • So far we’ve assumed that we know the underlying model λ = (A, B, π) • Often these parameters are estimated on annotated training data, but: • Annotation is often difficult and/or expensive • Testing data may be different from training data • We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model λ’ such that λ’ = argmaxλ p(O|λ)

p(O|λ): Likelihood • O: corpus of observations • An individual observation oi can occur multiple times • Likelihood of a corpus: • Log-likelihood:

Want to maximize the likelihood • We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model λ’ such that λ’ = argmaxλ p(O|λ) • Plug in different models λ and choose the best one, which maximizes

Optimization of model parameters • Unfortunately, there is no known way to analytically find a global maximum, i.e., a model λ’ , such that λ’ = argmaxλ p(O|λ) • But it is possible to find a local maximum • Given an initial model λ, we can always find a model λ’ , such that p(O|λ’) ≥ p(O|λ)

LING / C SC 439/539 Statistical Natural Language Processing