Statistical NLP Winter 2009

Statistical NLPWinter 2009 Lecture 8: Part-of-speech tagging Roger Levy Thanks to Dan Klein, Chris Manning, and Jason Eisner for slides

Parts-of-Speech (English) • One basic kind of linguistic structure: syntactic word classes Open class (lexical) words Nouns Verbs Adjectives yellow Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers … more 122,312 one Closed class (functional) Modals Determiners Prepositions the some to with can had Conjunctions Particles and or off up Pronouns he its … more

The Tagging Task Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj • Uses: • text-to-speech (how do we pronounce “lead”?) • can write regexps like (Det) Adj* N+ over the output • preprocessing to speed up parser (but a little dangerous) • if you know the tag, you can back off to it in other tasks 600.465 - Intro to NLP - J. Eisner

Why Do We Care? • The first statistical NLP task • Been done to death by different methods • Easy to evaluate (how many tags are correct?) • Canonical finite-state task • Can be done well with methods that look at local context • Though should “really” do it by parsing! Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj 600.465 - Intro to NLP - J. Eisner

Current Performance • How many tags are correct? • About 97% currently • But baseline is already 90% • Baseline is performance of stupidest possible method • Tag every word with its most frequent tag • Tag unknown words as nouns Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj 600.465 - Intro to NLP - J. Eisner

Why POS Tagging? • Useful in and of itself • Text-to-speech: record, lead • Lemmatization: saw[v]  see, saw[n]  saw • Quick-and-dirty NP-chunk detection: grep {JJ | NN}* {NN | NNS} • Useful as a pre-processing step for parsing • Less tag ambiguity means fewer parses • However, some tag choices are better decided by parsers IN DT NNP NN VBD VBN RP NN NNS The Georgia branch had taken on loan commitments … VDN DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted …

Part-of-Speech Ambiguity • Example • Two basic sources of constraint: • Grammatical environment • Identity of the current word • Many more possible features: • … but we won’t be able to use them for a while VBD VB VBN VBZ VBP VBZ NNP NNS NN NNS CD NN Fed raises interest rates 0.5 percent

correct tags PN Verb Det Noun Prep Noun Prep Det Noun What Should We Look At? Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too … 600.465 - Intro to NLP - J. Eisner

Today’s main goal • Introduce Hidden Markov Models (HMMs) for part-of-speech tagging/sequence classification • Cover the three fundamental questions of HMMs: • How do we fit the model parameters of an HMM? • Given an HMM, how do we efficiently calculate the likelihood of an observation w? • Given an HMM and an observation w, how do we efficiently calculate the most likely state sequence for w? • Next time: we’ll cover richer models for sequence classification

s0 s1 s2 sn w1 w2 wn HMMs • We want a model of sequences s and observations w • Assumptions: • States are tag n-grams • Usually a dedicated start and end state / word • Tag/state sequence is generated by a markov model • Words are chosen independently, conditioned only on the tag/state • These are totally broken assumptions: why?

Transitions and Emissions

s0 s0 s1 s1 s2 s2 sn sn w1 w1 w2 w2 wn wn Transitions • Transitions P(s|s’) encode well-formed tag sequences • In a bigram tagger, states = tags • In a trigram tagger, states = tag pairs <> < t1> < t2> < tn> <,> < , t1> < t1, t2> < tn-1, tn>

Estimating Transitions • Use standard smoothing methods to estimate transitions: • Can get a lot fancier (e.g. KN smoothing), but in this case it doesn’t buy much • One option: encode more into the state, e.g. whether the previous word was capitalized (Brants 00)

Estimating Emissions • Emissions are tricker: • Words we’ve never seen before • Words which occur with tags we’ve never seen • One option: break out the Good-Turning smoothing • Issue: words aren’t black boxes: • Unknown words usually broken into word classes • Another option: decompose words into features and use a maxent model along with Bayes’ rule 343,127.23 11-year Minteria reintroducibly D+,D+.D+ D+-x+ Xx+ x+“ly”

Better Features • Can do surprisingly well just looking at a word by itself: • Word the: the  DT • Lowercased word Importantly: importantly  RB • Prefixes unfathomable: un-  JJ • Suffixes Importantly: -ly  RB • Capitalization Meridian: CAP  NNP • Word shapes 35-year: d-x  JJ • Then build a maxent (or whatever) model to predict tag • Maxent P(t|w): 93.7% / 82.6%

Disambiguation • Given these two multinomials, we can score any word / tag sequence pair • In principle, we’re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) <,> <,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP> NNP VBZ NN NNS CD NN . Fed raises interest rates 0.5 percent . P(NNP|<,>) P(Fed|NNP) P(VBZ|<NNP,>) P(raises|VBZ) P(NN|VBZ,NNP)….. logP = -23 NNP VBZ NN NNS CD NN NNP NNS NN NNS CD NN logP = -29 NNP VBZ VB NNS CD NN logP = -27

Finding the Best Trajectory • Too many trajectories (state sequences) to list • Option 1: Beam Search • A beam is a set of partial hypotheses • Start with just the single empty trajectory • At each derivation step: • Consider all continuations of previous hypotheses • Discard most, keep top k, or those within a factor of the best, (or some combination) • Beam search works relatively well in practice • … but sometimes you want the optimal answer • … and you need optimal answers to validate your beam search Fed:NNP raises:NNS Fed:NNP Fed:NNP raises:VBZ <> Fed:VBN Fed:VBN raises:NNS Fed:VBD Fed:VBN raises:VBZ

The goal: efficient exact inference • We want a way of computing exactly: • The probability of any sequence w given the model • Most likely state sequence given the model and some w • BUT!There are exponentially many possible state sequences |T|n • We can, however efficiently calculate (1) and (2) using dynamic programming(DP) • The DP algorithm used for efficient state-sequence inference uses a trellis of paths through state space • It is an instance of what in NLP is called the Viterbi Algorithm

HMM Trellis

The Viterbi Algorithm • Dynamic program for computing • The score of a best path up to position i ending in state s • Also store a back-trace: most likely previous state for each state • Iterate on i, storing partial results as you go

A simple example • [enter handout/whiteboard mini-example]

So How Well Does It Work? • Choose the most common tag • 90.3% with a bad unknown word model • 93.7% with a good one • TnT (Brants, 2000): • A carefully smoothed trigram tagger • Suffix trees for emissions • 96.7% on WSJ text (SOA is ~97.2%) • Noise in the data • Many errors in the training and test corpora • Probably about 2% guaranteed error from noise (on this data) JJ JJ NN chief executive officer NN JJ NN chief executive officer JJ NN NN chief executive officer DT NN IN NNVBDNNS VBD The average of interbankofferedrates plummeted … NN NN NN chief executive officer

Theoretical recap • An HMM is generative and is easy to train from supervised labeled sequence data • With a trained HMM, we can use Bayes’ rule for noisy-channel evaluation of P(s|w) (s=states, w=words): • But just like with TextCat, we might want to do discriminative classification (directly estimate P(s|w))

Overview: Accuracies • Roadmap of (known / unknown) accuracies: • Most freq tag: ~90% / ~50% • Trigram HMM: ~95% / ~55% • Maxent P(t|w): 93.7% / 82.6% • TnT (HMM++): 96.2% / 86.0% • MEMM tagger: 96.9% / 86.9% • Bidirectional dependencies: 97.2% / 89.0% • Upper bound: ~98% (human agreement) Most errors on unknown words

How to improve supervised results? • Better features! • We could fix this with a feature that looked at the next word • We could fix this by linking capitalized words to their lowercase versions • More general solution: Maximum-entropy Markov models • Reality check: • Taggers are already pretty good on WSJ journal text… • What the world needs is taggers that work on other text! RB PRP VBD IN RB IN PRP VBD . They left as soon as he arrived . JJ NNP NNS VBD VBN . Intrinsic flaws remained undetected .

Discriminative sequence classification • We will look once again at Maximum-Entropy models as a basis for sequence classification • We want to calculate the joint distribution P(s|w), but it is too sparse • We will once again use a Markov assumption • As before, we can identify states with tag tuples to obtain dependence on previous n-1 tags • e.g., tag trigrams:

Maxent Markov Models (MEMMs) • This is sometimes called a maxent tagger, or MEMM • Key point: the next-tag distribution conditions on the entire word sequence! • Why can this be done? • How is this in principle useful?

MEMM taggers: graphical visualization • The HMM tagger had words generated from states: • The MEMM tagger commonly has states conditioned on same-position words only: • It’s also possible to condition states on all words:

MaxEnt & Feature Templates • In MaxEnt, we directly estimate conditional distributions • Of course, there is too much sparsity to do this directly • So we featurize the history (just like for text categorization)

MaxEnt Classifiers • Given the featurization, how to estimate conditional probabilities? • Exponential (log-linear, maxent, logistic, Gibbs) models: • Turn the votes into a probability distribution: • For any weight vector {i}, we get a conditional probability model P(c | d,). • We want to choose parameters {i}that maximize the conditional (log) likelihood of the data: Next tag parameters Makes votes positive. Normalizes votes. Conditioning info

Building a Maxent Model • How to define features: • Features are patterns in the input which we think the weighted vote should depend on • Usually features added incrementally to target errors • If we’re careful, adding some mediocre features into the mix won’t hurt (but won’t help either) • How to learn model weights? • Maxent just one method • Use a numerical optimization package • Given a current weight vector, need to calculate (repeatedly): • Conditional likelihood of the data • Derivative of that likelihood wrt each feature weight

Featurizing histories • Features corresponding to the generative HMM: • An “emission feature”: <wi=future, ti=JJ> • A “transition feature”: <ti=JJ, ti-1=DT> • Feature templates: <wi, ti> <ti, ti-1> • But we can be much more flexible: • Mixed feature templates: • < ti-1, wi, ti > • Or feature templates picking out subregularities: • <wi matches /[A-Z].*/, ti==NNP or ti==NNPS>

Maximum-likelihood parameter estimation • The (log) conditional likelihood is a function of the iid data (C,D) and the parameters : • If there aren’t many values of c, it’s easy to calculate: • We can separate this into two components:

Likelihood maximization • We have a function to optimize: • It turns out that this function is convex • We know the function’s derivatives: • Ready to feed it into a numerical optimization package… • [also turns out to be the maximum entropy sol’n]

Smoothing: Issues of Scale • Lots of features: • NLP maxent models can have over 1M features. • Even storing a single array of parameter values can have a substantial memory cost. • Lots of sparsity: • Overfitting very easy – need smoothing! • Many features seen in training will never occur again at test time. • Optimization problems: • Feature weights can be infinite, and iterative solvers can take a long time to get to those infinities.

Smoothing: Issues • Assume the following empirical distribution: • Features: {Heads}, {Tails} • We’ll have the following model distribution: • Really, only one degree of freedom ( = H-T) 

Smoothing: Issues • The data likelihood in this model is: log P log P log P   

Smoothing: Priors (MAP) • What if we had a prior expectation that parameter values wouldn’t be very large? • We could then balance evidence suggesting large parameters (or infinite) against our prior. • The evidence would never totally defeat the prior, and parameters would be smoothed (and kept finite!). • We can do this explicitly by changing the optimization objective to maximum posterior likelihood: • This is alternatively known as penalization in the non-Bayesian statistical literature, and the use of a priorin the Bayesian literature Posterior Prior Evidence

22=  22= 10 22=1 Smoothing: Priors • Gaussian, or quadratic, priors: • Intuition: parameters shouldn’t be large. • Formalization: prior expectation that each parameter will be distributed according to a gaussian with mean  and variance 2. • Penalizes parameters for drifting to far from their mean prior value (usually =0). • 22=1 works surprisingly well (better to set using held-out data, though)

22=  22= 10 22=1 Smoothing: Priors • If we use gaussian priors: • Trade off some expectation-matching for smaller parameters. • When multiple features can be recruited to explain a data point, the more common ones generally receive more weight. • Accuracy generally goes up! • Change the objective: • Change the derivative:

Decoding • Decoding maxent taggers: • Just like decoding HMMs • Viterbi, beam search, posterior decoding • Viterbi algorithm (HMMs): • Viterbi algorithm (Maxent):

Iteratively improving your model • With MaxEnt models, you need feature templates • There are no a priori principles of choosing which templates to include in a model • Two basic approaches • Hand-crafted feature (template)s • Search in the space of possible feature templates (feature/model selection) • We saw an example of the latter approach in TextCat • Now we’ll look at an example of the former: Toutanova & Manning 2000

Statistical NLP Winter 2009