1 / 77

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 15 3 /2/2013. Recommended reading. Jurafsky & Martin Chapter 5 through 5.7, Part of Speech Tagging Chapter 6 through 6.5, Hidden Markov Models Toutanova & Manning (2000): error analysis for POS tagging. Outline.

ghita
Télécharger la présentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 15 • 3/2/2013

  2. Recommended reading • Jurafsky & Martin • Chapter 5 through 5.7, Part of Speech Tagging • Chapter 6 through 6.5, Hidden Markov Models • Toutanova & Manning (2000): error analysis for POS tagging

  3. Outline • HMM review • The 3 HMM problems: evaluation • The 3 HMM problems: decoding • Maximum likelihood estimation (optional) • The 3 HMM problems: learning (optional) • Evaluation of POS tagging

  4. Parameters of a HMM • States: a set of states S=s1 , …, sn • Initial state distribution: πi is the probability that si is a start state • Transition probabilities: A = a1,1 , a1,2 , …, an,n Each ai,jis the prob. of transitioning from state sito sj • Emission probabilities: B = b1(ot), b2(ot), …, bn(ot) bi(ot)is the prob. of emitting observation otat state si

  5. Toy HMM(will work through computations later) .05 .1 .05 DT .5 VB .4 .1 .6 .9 NN .3

  6. Same, in terms of HMM parameters .05 .1 .05 0, DT .5 1, VB .4 .1 .6 .9 2, NN .3

  7. Path through HMM • A path through the HMM generates an observation sequence O and a hidden state sequence S. • Training data: pairs of (O, S) • POS tagging: given O, infer the most likely S

  8. POS tagging using a HMM • We are given W = w1, …, wn, and want to find the tag seq. T = t1, …, tn that maximizes p(T | W): • According to parameters of HMM:

  9. The three basic HMM problems • Problem 1 (Evaluation): Given the observation sequence O=o1, …, oTand an HMM λ = (A, B, π), how do we compute p(O|λ), the prob. of O given the model? • A = state transition probs. • B = symbol emission probs. • π = initial state probs. • Problem 2 (Decoding): Given the observation sequence O=o1, …, oTand an HMM λ = (A, B, π), how do we find argmaxSp(S|O, λ), the state sequence S that best explains the observations?

  10. The three basic HMM problems • Problem 3 (Learning): How do we set the model parameters λ = (A, B, π), to maximize p(O| λ)? • If supervised tagging: • Read parameters directly off of labeled training data • If unsupervised tagging: • We are given observation sequences only, and don’t know the corresponding hidden sequences • Apply Baum-Welch algorithm

  11. Example • Evaluation: what is the probability of “heat the oil” given the HMM? p(heat the oil | λ ) • Problem: many possible POS tag state sequences for this observation sequence • Decoding: what is the most likely POS tag state sequence for “heat the oil”? argmaxT p(T | heat the oil) = argmaxT p(heat the oil | T)*p(T) = argmaxT p(heat the oil, T)

  12. Outline • HMM review • The 3 HMM problems: evaluation • The 3 HMM problems: decoding • Maximum likelihood estimation (optional) • The 3 HMM problems: learning (optional) • Evaluation of POS tagging

  13. Problem 1: probability of an observation sequence • What is p(O|λ)? • The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM that generate this observation sequence. • For example, p(‘heat the oil’ | λ ) is the sum of probabilities of all possible tag sequences for the sentence, such as VB DT NN and NN DT NN

  14. Problem 1: probability of an observation sequence • What is p(O|λ)? • The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM that generate this observation sequence. • Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. • Even for small HMMs; e.g. when T=10 and N=10, there are 10 billion different paths • Solution: use dynamic programming • Can calculate either the forward probability or the backward probability

  15. Dynamic programming • General algorithm procedure for cases where: • Exponential number of cases to consider • Problem can be broken apart into subproblems • Recursively computed solutions to subproblems can be used to solve larger problems • Solutions to subproblems are used multiple times

  16. HMM trellis: shows the set of states at each time tsi,j = state j at time i

  17. Forward probabilities • Given an HMM λ, what is the probability that at time t the state is i, and the partial observation sequence o1 , …, ot is generated?

  18. Forward probability recurrence o: observation t: time i: source state j: destination state a: prob of state transition alpha: forward prob for a state b: emission prob

  19. Forward Algorithm • Initialization: • Induction: • At termination, prob of observation is sum of probs of all forward paths for that observation: where

  20. Forward algorithm complexity • Naïve approach (enumerating all possible paths) is O(2T*NT) • Forward algorithm is O(N2T) because of dynamic programming

  21. Example HMM .05 .1 .05 0, DT .5 1, VB .4 .1 .6 .9 2, NN .3

  22. Prob. of “heat the oil” given the HMM • Marginalize over final states: • Rewrite in terms of alpha:

  23. Work through on board:Compute p(heat the oil | λ) • Inputs: • HMM λ = (π, a, b) • o = heat the oil • Procedure: • Draw trellis • T = 3 because there are 3 observations in o • Compute forward probabilities αt(i) for each time step t and each state i, given the observations o • Use forward algorithm • Return Σi α3(i)

  24. Forward Algorithm again • Initialization: • Induction: • At termination, prob of observation is sum of probs of all forward paths for that observation: where

  25. (optional) Backward probabilities • Given an HMM λ, what is the probability that, at time t the state is i, and the partial observation sequence ot+1 … oTis generated? • Analogous to forward probability, just in the other direction

  26. (optional) Backward probability recurrence o: observation t: time i: source state j: destination state a: prob of state transition beta: backward prob for a state b: emission prob

  27. (optional) Backward algorithm • Initialization: • Induction: • At termination, prob of observation is sum of probs of all backward paths for that observation: where

  28. Forward and backward algorithms • At termination, prob of observation is sum of probs of all forward paths for that observation: • At termination, prob of observation is sum of probs of all backward paths for that observation:

  29. Outline • HMM review • The 3 HMM problems: evaluation • The 3 HMM problems: decoding • Maximum likelihood estimation (optional) • The 3 HMM problems: learning (optional) • Evaluation of POS tagging

  30. Problem 2: decoding • Given the observation sequence O = o1 , …, oTand an HMM, how do we find the state sequence that best explains the observations? • We want to find the state sequence Q=q1, …, qTthat maximizes its conditional probability given: • O, the observation sequence, and • λ, the parameters of the HMM • Example: argmaxQ’ p(Q’| ‘heat the oil’, λ) = most likely tag sequence for the sentence

  31. Viterbi algorithm: probability of most likely path given an observation • Similar to computing forward probabilities, but instead of summing over transitions from incoming states, compute the maximum • Forward: • Viterbi:

  32. Core idea of Viterbi algorithm o: observation t: time i: source state j: destination state a: prob of state transition delta: viterbiprob for a state b: emission prob

  33. Which path was most likely? • Viterbi algorithm: probability of most likely path given an observation • Doesn’t tell you which path was most likely • Need to also keep track of Ψt(j) = i • For the current state j at time t, which previous state i had led to the most likely path to the current state • Backpointers • Compute qt : after path probabilities have been computed, recursively follow backpointers to recover the most-likely path through time t

  34. Viterbi algorithm • Initialization: • Induction: • Termination: • Read out path:

  35. Example HMM .05 .1 .05 0, DT .5 1, VB .4 .1 .6 .9 2, NN .3

  36. Demonstrate Viterbi algorithm on board

  37. Outline • HMM review • The 3 HMM problems: evaluation • The 3 HMM problems: decoding • Maximum likelihood estimation (optional) • The 3 HMM problems: learning (optional) • Evaluation of POS tagging

  38. Let’s toss a coin • 1000 trials • 300 heads • 700 tails • p(heads) = .3 • This is the maximum likelihood estimate • What does this mean?

  39. Let’s toss a coin 3 times, twice • Suppose we know how the coin is weighted: p(head) = .3 • Therefore, p(tail) = 1 - p(head) = .7 • Two independent trials: • Toss it 3 times: get head, tail, tail p(head, tail, tail) = .3 * .7 * .7 = .147 • Toss it 3 times: get tail, tail, head p(tail, tail, head) = .7 * .7 * .3 = .147

  40. Likelihood • Prob. of set of observations O: • To calculate this, we need the parameters of the model λthat generated the observations • Rewrite this as the likelihood function, the joint prob. of our observations given model params:

  41. Likelihood for our set of observations • Suppose: O = { <head, tail, tail>, <tail, tail, head> } λ: p(head) = .3 = (.3*.7*.7) * (.7*.7*.3) = 0.294

  42. Maximum likelihood estimate • Suppose we don’t know λ • p(head) is unknown. • We can make a guess about λ and then calculate the likelihood of the data. • The “best” guess is the maximum likelihood estimate, which is the set of parameter values λ that maximizes L(O, λ)

  43. Value of λ that maximizes the likelihood is the true probability • O = { <head, tail, tail>, <tail, tail, head> } • Make a guess about λ • If we guess that p(head) = 0.2, then L(O, λ) = (.2*.8*.8) * (.8*.8*.2) = 0.256 • If we guess that p(head) = 0.3, then L(O, λ) = (.3*.7*.7) * (.7*.7*.3) = 0.294  • If we guess that p(head) = 0.4, then L(O, λ) = (.4*.6*.6) * (.6*.6*.4) = 0.288 • If we guess that p(head) = 0.5, then L(O, λ) = (.5*.5*.5) * (.5*.5*.5) = 0.250

  44. Maximization of log L(O, λ) also maximizes L(O, λ) • O = { <head, tail, tail>, <tail, tail, head> } • Make a guess about λ • If we guess that p(head) = 0.2, then log L(O, λ) = log((.2*.8*.8) * (.8*.8*.2)) = -1.3626 • If we guess that p(head) = 0.3, then log L(O, λ) = log((.3*.7*.7) * (.7*.7*.3)) = -1.2242  • If we guess that p(head) = 0.4, then log L(O, λ) = log((.4*.6*.6) * (.6*.6*.4)) = -1.2448 • If we guess that p(head) = 0.5, then log L(O, λ) = log((.5*.5*.5) * (.5*.5*.5)) = -1.386 • Equivalently, minimize the negative log-likelihood

  45. Outline • HMM review • The 3 HMM problems: evaluation • The 3 HMM problems: decoding • Maximum likelihood estimation (optional) • The 3 HMM problems: learning (optional) • Evaluation of POS tagging

  46. The three basic HMM problems • Problem 3 (Learning): How do we set the model parameters λ = (A, B, π), to maximize p(O| λ)? • If doing supervised learning: • Just read parameters directly off of labeled training data • If doing unsupervised learning : • We are given observation sequences only, and don’t know their corresponding hidden sequences • Apply Baum-Welch algorithm

  47. Problem 3: unsupervised learning • So far we’ve assumed that we know the underlying model λ = (A, B, π) • Often these parameters are estimated on annotated training data, but: • Annotation is often difficult and/or expensive • Testing data may be different from training data • We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model λ’ such that λ’ = argmaxλ p(O|λ)

  48. p(O|λ): Likelihood • O: corpus of observations • An individual observation oi can occur multiple times • Likelihood of a corpus: • Log-likelihood:

  49. Want to maximize the likelihood • We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model λ’ such that λ’ = argmaxλ p(O|λ) • Plug in different models λ and choose the best one, which maximizes

  50. Optimization of model parameters • Unfortunately, there is no known way to analytically find a global maximum, i.e., a model λ’ , such that λ’ = argmaxλ p(O|λ) • But it is possible to find a local maximum • Given an initial model λ, we can always find a model λ’ , such that p(O|λ’) ≥ p(O|λ)

More Related