Part-of-speech tagging

Part-of-speech tagging A simple but useful form of linguistic analysis Christopher Manning

Parts of Speech • Perhaps starting with Aristotle in the West (384–322 BCE), there was the idea of having parts of speech • a.k.a lexical categories, word classes, “tags”, POS • It comes from Dionysius Thrax of Alexandria (c. 100 BCE) the idea that is still with us that there are 8 parts of speech • But actually his 8 aren’t exactly the ones we are taught today • Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun • School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection

Open class (lexical) words Nouns Verbs Adjectives old older oldest Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers … more 122,312 one Closed class (functional) Modals Determiners Prepositions the some to with can had … more Conjunctions Particles and or off up Interjections Pronouns he its Ow Eh

Open vs. Closed classes • Open vs. Closed classes • Closed: • determiners: a, an, the • pronouns: she, he, I • prepositions: on, under, over, near, by, … • Why “closed”? • Open: • Nouns, Verbs, Adjectives, Adverbs.

POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word.

POS Tagging • Input: Plays well with others • Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS • Output: Plays/VBZ well/RB with/IN others/NNS • Uses: • Text-to-speech (how do we pronounce “lead”?) • Can write regexps like (Det) Adj* N+ over the output for phrases, etc. • As input to or to speed up a full parser • If you know the tag, you can back off to it in other tasks Penn Treebank POS tags

POS tagging performance • How many tags are correct? (Tag accuracy) • About 97% currently • But baseline is already 90% • Baseline is performance of stupidest possible method • Tag every word with its most frequent tag • Tag unknown words as nouns • Partly easy because • Many words are unambiguous • You get points for them (the, a, etc.) and for punctuation marks!

Deciding on the correct part of speech can be difficult even for people • Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG • All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN • Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

How difficult is POS tagging? • About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech • But they tend to be very common words. E.g., that • I know that he is honest = IN • Yes, that play was nice = DT • You can’t go that far = RB • 40% of the word tokens are ambiguous

Part-of-speech tagging revisited A simple but useful form of linguistic analysis Christopher Manning

Sources of information • What are the main sources of information for POS tagging? • Knowledge of neighboring words • Bill saw that man yesterday • NNP NN DT NN NN • VB VB(D) IN VB NN • Knowledge of word probabilities • man is rarely used as a verb…. • The latter proves the most useful, but the former also helps

More and Better Features  Feature-based tagger • Can do surprisingly well just looking at a word by itself: • Word the: the  DT • Lowercased word Importantly: importantly  RB • Prefixes unfathomable: un-  JJ • Suffixes Importantly: -ly  RB • Capitalization Meridian: CAP  NNP • Word shapes 35-year: d-x  JJ • Then build a maxent (or whatever) model to predict tag • Maxent P(t|w): 93.7% overall / 82.6% unknown

Overview: POS Tagging Accuracies • Rough accuracies: • Most freq tag: ~90% / ~50% • Trigram HMM: ~95% / ~55% • Maxent P(t|w): 93.7% / 82.6% • TnT (HMM++): 96.2% / 86.0% • MEMM tagger: 96.9% / 86.9% • Bidirectional dependencies: 97.2% / 90.0% • Upper bound: ~98% (human agreement) Most errors on unknown words

How to improve supervised results? • Build better features! • We could fix this with a feature that looked at the next word • We could fix this by linking capitalized words to their lowercase versions RB PRP VBD IN RB IN PRP VBD . They left as soon as he arrived . JJ NNP NNS VBD VBN . Intrinsic flaws remained undetected .

Tagging Without Sequence Information Baseline Three Words t0 t0 w0 w-1 w0 w1 Using words only in a straight classifier works as well as a basic (HMM or discriminative) sequence model!!

Summary of POS Tagging For tagging, the change from generative to discriminative model does not by itself result in great improvement One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc. An MEMM allows integration of rich features of the observations, but can suffer strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words This additional power (of the MEMM ,CRF, Perceptron models) has been shown to result in improvements in accuracy The higher accuracy of discriminative models comes at the price of much slower training

Introduction to Natural Language Processing (600.465)Tagging, Tagsets, and Morphology Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic

The task of (Morphological) Tagging • Formally: A+ → T • A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes) • often: phonemes ~ letters • T is the set of tags (the “tagset”) • Recall: 6 levels of language description: • phonetics ... phonology ... morphology ... syntax ... meaning ... - a step aside: • Recall: A+ → 2(L,C1,C2,...,Cn) → T morphology tagging: disambiguation ( ~ “select”) tagging

Tagging Examples • Word form: A+ → 2(L,C1,C2,...,Cn) → T • He always books the violin concert tickets early. • MA: books → {(book-1,Noun,Pl,-,-),(book-2,Verb,Sg,Pres,3)} • tagging (disambiguation): ... → (Verb,Sg,Pres,3) • ...was pretty good. However, she did not realize... • MA: However → {(however-1,Conj/coord,-,-,-),(however-2,Adv,-,-,-)} • tagging: ... → (Conj/coord,-,-,-) • [æ n d] [g i v] [i t] [t u:] [j u:] (“and give it to you”) • MA: [t u:] → {(to-1,Prep),(two,Num),(to-2,Part/inf),(too,Adv)} • tagging: ... → (Prep)

Tagsets • General definition: • tag ~ (c1,c2,...,cn) • often thought of as a flat list T = {ti}i=1..n with some assumed 1:1 mapping T ↔ (C1,C2,...,Cn) • English tagsets (see MS): • Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.) • Brown Corpus (87), Claws c5 (62), London-Lund (197)

Other Language Tagsets • Differences: • size (10..10k) • categories covered (POS, Number, Case, Negation,...) • level of detail • presentation (short names vs. structured (“positional”)) • Example: • Czech: AGFS3----1A---- VAR POSSN GENDER POS PERSON CASE NEG VOICE POSSG DCOMP SUBPOS TENSE NUMBER

Tagging Inside Morphology • Do tagging first, then morphology: • Formally: A+ → T→ (L,C1,C2,...,Cn) • Rationale: • have |T| < |(L,C1,C2,...,Cn)| (thus, less work for the tagger) and keep the mapping T→ (L,C1,C2,...,Cn) unique. • Possible for some languages only (“English-like”) • Same effect within “regular” A+ → 2(L,C1,C2,...,Cn) → T: • mapping R :(C1,C2,...,Cn)→ Treduced, then (new) unique mapping U: A+ ⅹTreduced → (L,T)

Lemmatization • Full morphological analysis: MA: A+ → 2(L,C1,C2,...,Cn) (recall: a lemma l ∈L is a lexical unit (~ dictionary entry ref) • Lemmatization: reduced MA: • L: A+→ 2L; w → {l: (l,t1,t2,...,tn) ∈MA(w)} • again, need to disambiguate (want: A+→ L) (special case of word sense disambiguation, WSD) • “classic” tagging does not deal with lemmatization (assumes lemmatization done afterwards somehow)

Morphological Analysis: Methods • Word form list • books: book-2/VBZ, book-1/NNS • Direct coding • endings: verbreg:s/VBZ, nounreg:s/NNS, adje:er/JJR, ... • (main) dictionary: book/verbreg, book/nounreg, nic/adje:nice • Finite state machinery (FSM) • many “lexicons”, with continuation links: reg-root-lex → reg-end-lex • phonology included but (often) clearly separated • CFG, DATR, Unification, ... • address linguistic rather than computational phenomena • in fact, better suited for morphological synthesis (generation)

Word Lists • Works for English • “input” problem: repetitive hand coding • Implementation issues: • search trees • hash tables (Perl!) • (letter) trie: • Minimization? a a,Art a,Artv n t at,Prep d t and,Conj ant,NN

Word-internal1 Segmentation (Direct) • Strip prefixes: (un-, dis-, ...) • Repeat for all plausible endings: • Split rest: root + ending (for every possible ending) • Find root in a dictionary, keep dictionary information • in particular, keep inflection class (such as reg, noun-irreg-e, ...) • Find ending, check inflection+prefix class match • If match found: • Output root-related info (typically, the lemma(s)) • Output ending-related information (typically, the tag(s)). 1Word segmentation is a different problem (Japanese, speech in general)

Finite State Machinery • Two-level Morphology (TLM) - KIMMO • phonology + “morphotactics” (= morphology) • Both components use finite-state automata: • phonology: “two-level rules”, converted to FSA • e:0 ⇔_ +:0 e:e r:r (nice+er  nicer) • morphology: linked lexicons • root-dic: book/”book”Þ end-noun-reg-dic • end-noun-reg-dic: +s/”NNS” • Integration of the two possible (and simple)

Finite State Transducer (for phonology) • FST is a FSA where • symbols are pairs (r:s) from a finite alphabets R and S. • “Checking” run: • input data: sequence of pairs, output: Yes/No (accept/do not) • use as a FSA • Analysis run: • input data: sequence of only s ∈ S (TLM: surface); • output: seq. of r ∈ R (TLM: lexical), + lexicon “glosses” (pos tag, etc) • Synthesis (generation) run: • same as analysis except roles are switched: S ↔R

Parallel Rules, Zero Symbols • Parallel Rules: • Each rule ~ one FST • Run in parallel • Any of them fails Þ path fails • Zero symbols (one side only, even though 0:0 o.k.) • behave like any other • (nice+er -> nicer) F5 e:0 F6 +:0

The Lexicon (for morphotactics) • Ordinary FSA (“lexical” alphabet only) • Used for analysis only • additional constraint: • lexical string must pass the linked lexicon list • Implemented as a FSA; compiled from lists of strings and lexicon links • Example: “bank” “NNS” a n k + s b o o k “book”

TLM Analysis Example • Bücher: • suppose each surface letter corresponds to the same symbol at the lexical level, just ü might be ü as well as u lexically; plus zeroes (+:0), (0:0) • Use the FST as before. • Use lexicons: root: Buch “book”Þ end-reg-uml Bündni “union”Þ end-reg-s end-reg-uml: +0 “NNomSg” +er “NNomPl” B:B Þ Bu:BüÞ Buc:Büc Þ Buch:Büch Þ Buch+e:Büch0e Þ Buch+er:Büch0er Þ Bü:BüÞ Büc:Büc u ü

TLM: Generation • Do not use the lexicon (well you have to put the “right” lexical strings together somehow!) • Start with a lexical string L. • Generate all possible pairs l:s for every symbol in L. • Find all (hopefully only 1!) traversals through the FST which end in a final state. • From all such traversals, print out the sequence of surface letters.

TLM: Some Remarks • Parallel FST (incl. final lexicon FSA) • can be compiled into a (gigantic) FST • maybe not so gigantic (XLT - Xerox Language Tools) • “Double-leveling” the lexicon: • allows for generation from lemma, tag • needs: rules with strings of unequal length • Rule Compiler • Karttunen, Kay • PC-KIMMO: free version from www.sil.org (Unix,too)

*Introduction to Natural Language Processing (600.465)Tagging (disambiguation): An Overview Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic

Rule-based Disambiguation • Example after-morphology data (using Penn tagset): I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP • Rules using • word forms, from context & current position • tags, from context and current position • tag sets, from context and current position • combinations thereof

Example Rules • If-then style: • DTeq,-1,Tag Þ NN • (implies NNin,0,Set as a condition) • PRPeq,-1,Tagand DTeq,+1,Tag Þ VBP • {DT,NN}sub,0,SetÞ DT • {VB,VBZ,VBP,VBD,VBG}inc,+1,TagÞnot DT • Regular expressions: • not(<*,*,DT> <*,*,notNN>)) • not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>) • not(<*,{DT,NN}sub,notDT>) • not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>) I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

Implementation • Finite State Automata • parallel (each rule ~ automaton); • algorithm: keep all paths which cause all automata say yes • compile into single FSA (intersection) • Algorithm: • a version of Viterbi search, but: • no probabilities (“categorical” rules) • multiple input: • keep track of all possible paths

Example: the FSA • R1: not(<*,*,DT> <*,*,notNN>)) • R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>) • R3: not(<*,{DT,NN}sub,notDT>) • R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>) • R1: • R3: anything <*,*,DT> <*,*,notNN> anything else F1 N3 F2 anything else anything <*,{DT,NN}sub,notDT> anything else F1 N2

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP Applying the FSA • R1: not(<*,*,DT> <*,*,notNN>)) • R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>) • R3: not(<*,{DT,NN}sub,notDT>) • R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>) • R1 blocks: remains: or • R2 blocks: remains e.g.: and more • R3 blocks: remains only: • R4 Ì R1! a fly DT NN a fly NN NN VB VBP a fly DT VB VBP I watch a NN DT PRP VB I watch a DT PRP VBP a NN a DT

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP Applying the FSA (Cont.) a fly DT NN a fly NN NN VB VBP • Combine: • Result: I watch a DT PRP VBP a DT I watch a fly . PRP VBP DT NN .

Tagging by Parsing • Build a parse tree from the multiple input: • Track down rules: e.g., NP → DT NN: extract (a/DT fly/NN) • More difficult than tagging itself; results mixed S VP NP I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

Statistical Methods (Overview) • “Probabilistic”: • HMM • Merialdo and many more (XLT) • Maximum Entropy • DellaPietra et al., Ratnaparkhi, and others • Rule-based: • TBEDL (Transformation Based, Error Driven Learning) • Brill’s tagger • Example-based • Daelemans, Zavrel, others • Feature-based (inflective languages) • Classifier Combination (Brill’s ideas)

*Introduction to Natural Language Processing (600.465)HMM Tagging Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic

Review • Recall: • tagging ~ morphological disambiguation • tagset VTÌ (C1,C2,...Cn) • Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ... • mapping w → {t ∈VT} exists [just word tagging!] • restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn) where A is the language alphabet, L is the set of lemmas • extension to punctuation, sentence boundaries (treated as words)

The Setting • Noisy Channel setting: Input (tags) Output (words) The channel NNP VBZ DT... (adds “noise”) John drinks the ... • Goal (as usual): discover “input” to the channel (T, the tag seq.) given the “output” (W, the word sequence) • p(T|W) = p(W|T) p(T) / p(W) • p(W) fixed (W given)... argmaxT p(T|W) = argmaxT p(W|T) p(T)

The HMM Model Definition • (Almost) the general HMM: • output (words) emitted by states (not arcs) • states: (n-1)-tuples of tags if n-gram tag model used • five-tuple (S, s0, Y, PS, PY), where: • S = {s0,s1,s2,...,sT} is the set of states, s0 is the initial state, • Y = {y1,y2,...,yV} is the output alphabet (the words), • PS(sj|si) is the set of prob. distributions of transitions • PS(sj|si) = p(ti|ti-n+1,...,ti-1); sj = (ti-n+2,...,ti), si = (ti-n+1,...,ti-1) • PY(yk|si) is the set of output (emission) probability distributions • another simplification: PY(yk|si) = PY(yk|sj) if si and sj contain the same tag as the rightmost element: PY(yk|si) = p(wi|ti)

Supervised Learning (Manually Annotated Data Available) • Use MLE • p(wi|ti) = cwt(ti,wi) / ct(ti) • p(ti|ti-n+1,...,ti-1) = ctn(ti-n+1,...,ti-1,ti) / ct(n-1)(ti-n+1,...,ti-1) • Smooth (both!) • p(wi|ti): “Add 1” for all possible tag,word pairs using a predefined dictionary (thus some 0 kept!) • p(ti|ti-n+1,...,ti-1): linear interpolation: • e.g. for trigram model: p’l(ti|ti-2,ti-1) = l3 p(ti|ti-2,ti-1) + l2 p(ti|ti-1) + l1 p(ti) + l0 / |VT|

Unsupervised Learning • Completely unsupervised learning impossible • at least if we have the tagset given- how would we associate words with tags? • Assumed (minimal) setting: • tagset known • dictionary/morph. analysis available (providing possible tags for any word) • Use: Baum-Welch algorithm • “tying”: output (state-emitting only, same dist. from two states with same “final” tag)

Comments on Unsupervised Learning • Initialization of Baum-Welch • is some annotated data available, use them • keep 0 for impossible output probabilities • Beware of: • degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!) • use heldout data for cross-checking • Supervised almost always better

Part-of-speech tagging