Advanced Techniques for Language Modeling and Text Probability Computation

Three Basic Problems • Compute the probability of a text (observation) • language modeling – evaluate alternative texts and models Pm(W1,N) • Compute maximum probability tag (state) sequence • Tagging/classification arg maxT1,N Pm(T1,N | W1,N) • Compute maximum likelihood model • training / parameter estimation arg maxm Pm(W1,N)

Compute Text Probability • Recall: P(W,T) = i P(ti-1ti) P(wi | ti) • Text probability: need to sum P(W,T) over all possible sequences – an exponential number • Dynamic programming approach – similar to the Viterbi algorithm • Will be used also for estimating model parameters from an untagged corpus

Forward Algorithm Define: Ai(k) = P(w1,k, tk= ti); Nt – total num. of tags For i = 1 To Nt: Ai(1) = m(t0ti)m(w1 | ti) • For k = 2 To N; For j = 1 To Nt: • Aj(k) = [iAi(k-1)m(titj)]m(wk | tj) • Then: Pm(W1,N) = iAi(N) Complexity = O(Nt2 N) (like Viterbi,  instead of max)

w1 w2 w3 A1(1) m(t1t1) m(t1t1) t1 t1 t1 m(t2t1) m(t2t1) A2(1) A2(2) A2(3) t2 t2 t2 m(t3t1) m(t3t1) m(t0ti) A3(1) A3(2) A3(3) t3 t3 t3 m(t4t1) m(t4t1) A4(1) A4(2) A4(3) t4 t4 t4 m(t5t1) m(t5t1) A5(1) A5(2) A5(3) t5 t5 t5 Pm(W1,3) Forward Algorithm A1(2) A1(3)

Backward Algorithm Define Bi(k) = P(wk+1,N | tk=ti) • For i = 1 To Nt: Bi(N) = 1 • For k = N-1 To 1; For j = 1 To Nt: • Bj(k) = [i m(tjti)m(wk+1 | ti)Bi(k+1)] • Then: Pm(W1,N) = i m(t0ti)m(w1 | ti)Bi(1) Complexity = O(Nt2 N)

w1 w2 w3 m(t1t1) m(t1t1) B1(3) t1 t1 t1 m(t0ti) m(t2t1) m(t2t1) B2(1) B2(2) B2(3) t2 t2 t2 m(t3t1) m(t3t1) B3(1) B3(2) B3(3) t3 t3 t3 m(t4t1) m(t4t1) B4(1) B4(2) B4(3) t4 t4 t4 m(t5t1) m(t5t1) B5(1) B5(2) B5(3) t5 t5 t5 Pm(W1,3) Backward Algorithm B1(1) B1(2)

Estimation from Untagged Corpus: EM – Expectation-Maximization • Start with some initial model • Compute the probability of (virtually) each state sequence given the current model • Use this probabilistic tagging to produce probabilistic counts for all parameters, and use these probabilistic counts to estimate a revised model, which increases the likelihood of the observed output W in each iteration • Repeat until convergence Note: No labeled training required. Initialize by lexicon constraints regarding possible POS for each word (cf. “noisy counting” for PP’s)

Notation • aij = Estimate of P(titj) • bjk = Estimate of P(wk | tj) • Ai(k) = P(w1,k, tk=ti) (from Forward algorithm) • Bi(k) = P(wk+1,N | tk=ti) (from Backwards algorithm)

Estimating transition probabilities Define pk(i,j) as prob. of traversing arc titj at time k given the observations: pk(i,j) = P(tk = ti, tk+1 = tj | W) = P(tk = ti, tk+1 = tj,W) / P(W) = =

Expected transitions • Define gi(k) = P(tk = ti | W), then: gi(k) = • Now note that: • Expected number of transitions from tag i = • Expected transitions from tag i to tag j =

Re-estimation of Maximum Likelihood Parameters • a’ij = = • b’ik = =

EM Algorithm • Choose initial model = <a,b,g(1)> • Repeat until results don’t improve (much): • Compute pk based on current model, using Forward & Backwards algorithms to compute A and B (Expectation for counts) • Compute new model <a’,b’,g’(1)> (Maximization of parameters) Note: Output likelihood is guaranteed to increase in each iteration, but might converge to a local maximum!

Initialize Model by Dictionary Constraints • Training should be directed to correspond to the linguistic perception of POS (recall local max) • Achieved by a dictionary with possible POS for each word • Word-based initialization: • P(w|t) = 1 / #of listed POS for w, for the listed POS; and 0 for unlisted POS • Class-based initialization (Kupiec, 1992): • Group all words with the same possible POS into a ‘metaword’ • Estimate parameters and perform tagging for metawords • Frequent words are handled individually

Some extensions for HMM POS tagging • Higher-order models: trigrams, possibly interpolated with bigrams • Incorporating text features: • Output prob = P(wi,fj| tk) where f is a vector of features (capitalized, ends in –d, etc.) • Features useful to handle unknown words • Combining labeled and unlabeled training (initialize with labeled then do EM)

Transformational Based Learning (TBL) for Tagging • Introduced by Brill (1995) • Can exploit a wider range of lexical and syntactic regularities via transformation rules – triggering environment and rewrite rule • Tagger: • Construct initial tag sequence for input – most frequent tag for each word • Iteratively refine tag sequence by applying “transformation rules” in rank order • Learner: • Construct initial tag sequence for the training corpus • Loop until done: • Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking

Some examples 1. Change NN to VB if previous is TO • to/TO conflict/NN with  VB 2. Change VBP to VB if MD in previous three • might/MD vanish/VBP VB 3. Change NN to VB if MD in previous two • might/MD reply/NN VB 4. Change VB to NN if DT in previous two • the/DT reply/VB  NN

Transformation Templates • Specify which transformations are possible For example: change tag A to tag B when: • The preceding (following) tag is Z • The tag two before (after) is Z • One of the two previous (following) tags is Z • One of the three previous (following) tags is Z • The preceding tag is Z and the following is W • The preceding (following) tag is Z and the tag two before (after) is W

Lexicalization New templates to include dependency on surrounding words (not just tags): Change tag A to tag B when: • The preceding (following) word is w • The word two before (after) is w • One of the two preceding (following) words is w • The current word is w • The current word is w and the preceding (following) word is v • The current word is w and the preceding (following) tag is X (Notice: word-tag combination) • etc…

Initializing Unseen Words • How to choose most likely tag for unseen words? Transformation based approach: • Start with NP for capitalized words, NN for others • Learn “morphological” transformations from: Change tag from X to Y if: • Deleting prefix (suffix) x results in a known word • The first (last) characters of the word are x • Adding x as a prefix (suffix) results in a known word • Word W ever appears immediately before (after) the word • Character Z appears in the word

TBL Learning Scheme Unannotated Input Text Setting Initial State Ground Truth for Input Text Annotated Text Learning Algorithm Rules

Greedy Learning Algorithm • Initial tagging of training corpus – most frequent tag per word • At each iteration: • Compute “error reduction” for each transformation rule: • #errors fixed - #errors introduced • Find best rule; If error reduction greater than a threshold (to avoid overfitting): • Apply best rule to training corpus • Append best rule to ordered list of transformations

Morphological Richness • Parts of speech really include features: • NN2  Noun(type=common,num=plural) This is more visible in other languages with richer morphology: • Hebrew nouns: number, gender, possession • German nouns: number, gender, case, … • And so on…

Advanced Techniques for Language Modeling and Text Probability Computation