Chapter 5: POS Tagging

Chapter 5: POS Tagging Heshaam Faili hfaili@ut.ac.ir University of Tehran

Part of Speech tag • POS, word classes, morphological classes, lexical tags • Verb Vs Noun • noun, verb, pronoun, preposition, adverb, Conjunction, participle, and article • possessive pronouns Vs personal pronouns • Useful for pronunciation: conTENT, CONtent • Useful for stemming for information retrieval • Useful for Word-sense-disambiguation • Useful for shallow parsing to detect name, date, place, …

POS tags • Based on syntactic and morphological function • Divided into • closed class types: pronoun, preposition, determiner(articles), conjunction, auxiliary verbs, particles, numerical • open class types: nouns, verbs, adjectives, adverb • Proper noun, common noun • Count noun, mass noun • Main verb, auxiliary verb, • Directional(locative) adverbs, degree adverb, manner adverb, temporal adverbs • particle, phrasal verb

Statistics • the: 1,071,676 a: 413,887 an: 59,359

Statistics

English language • Pronoun: personal, possessive, WH-pronoun • Auxiliary verb: modal verb (copula)

Example English Part-of-Speech Tag sets • Brown corpus - 87 tags • Allows compound tags • “I'm” tagged as PPSS+BEM • PPSS for "non-3rd person nominative personal pronoun" and BEM for "am, 'm“ • Others have derived their work from Brown Corpus • LOB Corpus: 135 tags • Lancaster UCREL Group: 165 tags • London-Lund Corpus: 197 tags. • BNC – 61 tags (C5) • PTB – 45 tags • Other languages have developed other tagsets

PTB Tagset (36 main tags + punctuation tags)

Some examples from Brown • The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. • There/EX are/VBP 70/CD children/NNS there/RB • Although/IN preliminary/JJ findings/NNS were/VBD reported/VBN more/RBR than/IN a/DT year/NN ago/IN ,/, the/DT latest/JJS results/NNS appear/VBP in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ,/, • Difficulty in tagging ‘around’’: particle, preposition, adverb • Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG • All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN • Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

POS Tagging Problem • Given a sentence W1…Wn and a tag set of lexical categories, find the most likely tag C1…Cn for each word in the sentence • Note that many of the words may have unambiguous tags • Example Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN People/NNS continue/VBP to/TOinquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • But enough words are either ambiguous or unknown that it’s a nontrivial task • Tokenization: punctuation marks (period, comma, etc) be separated off of the words.

More details of the problem • How ambiguous? • Most words in English have only one Brown Corpus tag • Unambiguous (1 tag) 35,340 word types • Ambiguous (2- 7 tags) 4,100 word types = 11.5% • 7 tags: 1 word type “still” • But many of the most common words are ambiguous • Over 40% of Brown corpus tokens are ambiguous • Obvious strategies may be suggested based on intuition • to/TO race/VB • the/DT race/NN • will/MD race/VB • This leads to hand-crafted rule-based POS tagging (J&M, 8.4) • Sentences can also contain unknown words for which tags have to be guessed: Secretariat/NNP is/VBZ

POS tagging

POS tagging methods • First, we’ll look at how POS-annotated corpora are constructed • Then, we’ll narrow in on various POS tagging methods, which rely on POS-annotated corpora • Supervised learning: use POS-annotated corpora as training material • HMMs, TBL, Maximum Entropy, MBL • Unsupervised learning: use training material which is unannotated • HMMs, TBL • Two main categories for tagging: rule-based, stochastic

Constructing POS-annotated corpora • By examining how POS-annotated corpora are created, we will understand the task even better • Basic steps • 1. Tokenize the corpus • 2. Design a tagset, along with guidelines • 3. Apply the tagset to the text • Knowing what issues the corpus annotators had to deal with in hand-tagging tells us the kind of issues we’ll have to deal with in automatic tagging

1. Tokenize the corpus • The corpus is first segmented into sentences and then each sentence is tokenized into unique tokens (words) • Punctuation usually split off • Figuring out sentence-ending periods vs. abbreviators periods is not always easy, etc. • Can become tricky: • Proper nouns and other multi-word units: one token or separate tokens? e.g., Jimmy James, in spite of • Contractions often split into separate words: can’t cann’t (because these are two different POSs) • Hyphenations sometimes split, sometimes kept together

Some other preprocessing issues • Should other word form information be included? • Lemma: the base, or root, form of a word. So, cat and cats have the same lemma, cat. • Sometimes, words are lemmatized by stemming, other times by morphological analysis, using a dictionary & morphological rules • Sometimes it makes a big difference – roughly vs. rough • Fold case or not (usually folded)? • The the THEMark versus mark • One may need, however, to regenerate the original case when presenting it to the user

2. Design the Tagset • It is not obvious what the parts of speech are in any language, so a tagset has to be designed. • Criteria: • External: what do we need to capture the linguistic properties of the text? • Internal: what distinctions can be made reliably in a text? • Some tagsets are compositional: each character in the tag has a particular meaning: e.g., Vr3s-f = verb-present-3rd-singular-<undefined_for_case>-female • What do you do with multi-token words, e.g. in terms of?

Tagging Guidelines • The tagset should be clearly documented and explained, with a “case law” of what to do in problematic cases • If you don’t do this, you will have problems in applying the tagset to the text • If you have good documentation, you can have high interannotator agreement and a corpus of high quality.

3. Apply Tagset to Text: PTB Tagging Process • 1. Tagset developed • 2. Automatic tagging by rule-based and statistical POS taggers, error rates of 2-6%. • 3. Human correction using a text editor • Takes under a month for humans to learn this (at 15 hours a week), and annotation speeds after a month exceed 3,000 words/hour • Inter-annotator disagreement (4 annotators, eight 2000-word docs) was 7.2% for the tagging task and 4.1% for the correcting task

POS Tagging Methods • Two basic ideas to build from: • Establishing a simple baseline with unigrams • Hand-coded rules • Machine learning techniques: • Supervised learning techniques • Unsupervised learning techniques

A Simple Strategy for POS Tagging • Choose the most likely tag for each ambiguous word, independent of previous words • i.e., assign each token the POS category it occurred as most often in the training set • E.g., race – which POS is more likely in a corpus? • This strategy gives you 90% accuracy in controlled tests • So, this “unigram baseline” must always be compared against

Example of the Simple Strategy • Which POS is more likely in a corpus (1,273,000 tokens)? NN VB Total race 400 600 1000 • P(NN|race) = P(race&NN) / P(race) by the definition of conditional probability • P(race)  1000/1,273,000 = .0008 • P(race&NN)  400/1,273,000 =.0003 • P(race&VB)  600/1,273,000 = .0005 • And so we obtain: • P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.4 • P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .6

Hand-coded rules • Two-stage system: • Dictionary assigns all possible tags to a word • Rules winnow down the list to a single tag (or sometimes, multiple tags are left, if it cannot be determined) • Can also use some probabilistic information • These systems can be highly effective, but they of course take time to write the rules. • We’ll see an example later of trying to automatically learn the rules (transformation-based learning)

Hand-coded Rules: ENGTWOL System • Based on two-level morphology • Uses 56,000-word lexicon which lists parts-of-speech for each word • Each entry is annotated with a set of morphological and syntactic features

ENGTWOL Lexicon

ENGTWOL tagger • First Stage return all possible POS • Input: “Pavlov had shown that salivation” • Output: • Pavlov PAVLOV N NOM SG PROPER • had HAVE V PAST VFIN SVOHAVE PCP2 SVO • shown SHOW PCP2 SVOO SVO SV • that ADV PRON DEM SG DET CENTRAL DEM SGCS • salivation N NOM SG

ENGTWOL tagger • Large set of constraints: 3744 constraints

Machine Learning • Machines can learn from examples • Learning can be supervised or unsupervised • Given training data, machines analyze the data, and learn rules which generalize to new examples • Can be sub-symbolic (rule may be a mathematical function) e.g., neural nets • Or it can be symbolic (rules are in a representation that is similar to representation used for hand-coded rules) • In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora

Ways of learning • Supervised learning: • A machine learner learns the patterns found in an annotated corpus • Unsupervised learning: • A machine learner learns the patterns found in an unannotated corpus • Often uses another database of knowledge, e.g., a dictionary of possible tags • Techniques used in supervised learning can be adapted to unsupervised learning, as we will see.

HMMs • Using probabilities in tagging first used on 1965 but completed o 1980s (Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988( • HMM is a special case of Bayesian inference

Transition/word likelihood probabilities • P(NN|DT) and P(JJ|DT) is high while P(DT|JJ) is low

What you want to do is find the “best sequence” of POS tags T=T1..Tn for a sentence W=W1..Wn. (Here T1 is pos_tag(W1)). find a sequence of POS tags T that maximizes P(T|W) Using Bayes’ Rule, we can say P(T|W) = P(W|T)*P(T)/P(W) We want to find the value of T which maximizes the RHS  denominator can be discarded (it’s the same for every T)  Find T which maximizes P(W|T) * P(T) Example: He will race Possible sequences: He/PP will/MD race/NN He/PP will/NN race/NN He/PP will/MD race/VB He/PP will/NN race/VB W = W1 W2 W3 = He will race T = T1 T2 T3 Choices: T= PP MD NN T= PP NN NN T = PP MD VB T = PP NN VB HMMs: A Probabilistic Approach Word likelihood transition

N-gram Models • POS problem formulation • Given a sequence of words, find a sequence of categories that maximizes P(T1…Tn| W1…Wn) • i.e., that maximizes P(W1…Wn | T1…Tn) * P(T1…Tn) (by Bayes’ Rule) • Chain Rule of probability: P(W|T) = i=1, nP(Wi|W1…Wi-1T1…Ti) prob. of this word based on previous words & tags P(T) = i=1, nP(Ti|W1…WiT1…Ti-1) prob. of this tag based on previous words & tags • But we don’t have sufficient data for this, and we would likely overfit the data, so we make some assumptions to simplify the problem …

Independence Assumptions • Assume that current event is based only on previous n-1 events (i.e., for bigram model, based on previous event) • P(T1…Tn) i=1, n P(Ti| Ti-1) • assumes that the event of a POS tag occurring is independent of the event of any other POS tag occurring, except for the immediately previous POS tag • From a linguistic standpoint, this seems an unreasonable assumption, due to long-distance dependencies • P(W1…Wn | T1…Tn) i=1, n P(Wi| Ti) • assumes that the event of a word appearing in a category is independent of the event of any surrounding word or tag, except for the tag at this position.

Independence Assumptions • P(Ti| Ti-1) represent the probability of a tag given the previous tag. • determiners are very likely to precede adjectives and nouns, as in sequences like that/DT flight/NN and the/DT yellow/JJ hat/NN. • P(NN|DT) and P(JJ|DT) are high but P(DT|JJ) is low

Hidden Markov Models • Linguists know both these assumptions are incorrect! • But, nevertheless, statistical approaches based on these assumptions work pretty well for part-of-speech tagging • In particular, one called a Hidden Markov Model (HMM) • Very widely used in both POS-tagging and speech recognition, among other problems • A Markov model, or Markov chain, is just a weighted Finite State Automaton

HMMs = Weighted FSAs • Nodes (or states): tag options • Edges: Each edge from a node has a (transition) probability to another state • The sum of transition probabilities of all edges going out from a node is 1 • The probability of a path is the product of the probabilities of the edges along the path. • Why multiply? Because of the assumption of independence (Markov assumption). • The probability of a string is the sum of the probabilities of all the paths for the string. • Why add? Because we are considering the union of all interpretations. • A string that isn’t accepted has probability zero.

example

HMM • Weighted finite-state automaton • Markov Chain: a special case of a weighted automaton in which the input sequence uniquely determines which states the automaton will go through • Use only observed events : in POS tagging problem: only words • It’s not useful for POS-tagging: because of ambiguity • Hidden Markov Model: use both • observed events ( words that we see in sentence) • Hidden events (part-of-speech tags)

Hidden Markov Models • Why the word “hidden”? • Called “hidden” because one cannot tell what state the model is in for a given sequence of words. • e.g., will could be generated from NN state, or from MD state, with different probabilities. • That is, looking at the words will not tell you directly what tag state you’re in.

POS Tagging Based on Bigrams • Problem: Find T which maximizes P(W | T) * P(T) • Here W=W1…Wn and T=T1…Tn • Using the bigram model, we get: • Transition probabilities (prob. of transitioning from one state/tag to another): • P(T1…Tn) i=1, n P(Ti|Ti-1) • Emission probabilities (prob. of emitting a word at a given state): • P(W1…Wn | T1…Tn) i=1, n P(Wi| Ti) • So, we want to find the value of T1…Tn which maximizes: i=1, n P(Wi| Ti) * P(Ti| Ti-1)

P(T1….Tn) i=1, n P(Ti|Ti-1) Example: He will race Choices for T=T1..T3 T= PRP MD NN T= PRP NN VB T = PRP MD VB T = PRP NN NN POS bigram probs from training corpus can be used for P(T) P(PRP-MD-NN)=1*.8*.4 =.32 MD .4 NN .8 1  PRP .3 .6 .2 NN .7 VB Using POS bigram probabilities: transitions C|R MD NN VB PRP MD .4 .6 NN .3 .7 PRP .8 .2 1 POS bigram probs

From the training corpus, we need to find the Ti which maximizes i=1, n P(Wi| Ti) * P(Ti| Ti-1) So, we’ll need to factor the lexical generation (emission) probabilities, somehow: MD .4 NN .8 1  PRP .3 .6 .2 NN .7 VB Factoring in lexical generation probabilities C E MD NN VB PRP he 0 0 0 1 will.8 .2 0 0 race 0 .4 .6 0 + B A F lexical generation probs D

Adding emission probabilities MD NN VB PRP he 0 0 0 1 will.8 .2 0 0 race 0 .4 .6 0 will|MD .8 .4 race|NN .4 .8 lexical generation probs 1 he|PRP .3 <s>| .3 .6 C|R MD NN VB PRP MD .4 .6 NN .3 .7 PRP .8 .2 1 .2 will|NN .2 race|VB .6 .7 pos bigram probs Like a weighted FSA, except that there are output probabilities associated with each node.

HMM formulation

Emission &Transition probability

Dynamic Programming • In order to find the most likely sequence of categories for a sequence of words, we don’t need to enumerate all possible sequences of categories. • Because of the Markov assumption, if you keep track of the most likely sequence found so far for each possible ending category, you can ignore all the other less likely sequences. • i.e., multiple edges coming into a state, but only keep the value of the most likely path • This is another use of dynamic programming • The algorithm to do this is called the Viterbi algorithm.

The Viterbi algorithm • Assume we’re at state I in the HMM • Obtain • the probability of each previous state H1…Hm • the transition probabilities of H1-I, …, Hm-I • the emission probability for word w at I • Multiple the probabilities for each new path: • P(H1,I) = Score(H1)*P(I|H1)*P(w|I) • … • One of these states (H1…Hm) will give the highest probability • Only keep the highest probability when using I for the next state

Chapter 5: POS Tagging