Part of Speech Tagging

Part of Speech Tagging John Barnden School of Computer Science University of Birmingham Natural Language Processing 1 2010/11 Semester 2

Overview • What the task is • Brief: the range of approaches • Some needed stuff on probability • The core stochastic (probabilistic) approach • A toy POS tagger

What the Task Is • Assign a part of speech (POS) or similar such classification to each lexical form (and possibly some multi-form sequences treated as units) in an utterance ... • and to punctuation marks and sometimes to affixes/clitics (e.g., ’s): • such things are separated off beforehand by a prior “tokenization” phase. • Usually described as being applied to text or transcribed speech, but could be to speech directly, in principle. • We will assume just ordinary text. • Transcribed speech: might need to cope with things like pause indicators • We will look just at the usual case of choosing one best-looking classification per word, but variant methods can assign more than one (e.g. the most probable few).

The “Output” –Text/Tags • The classification is usually recorded as a “tag” inserted into the text itself, next to each word, punctuation mark, etc. • E.g. [see J&M p.167] • Does that flight serve dinner? • Tagged as following [using Penn Treebank tagset: front pages of J&M] • Does/VBZ that/DT flight/NN serve/VB dinner/NN ?/. • VBZ = verb, 3rd-person singular present • VB = verb, base form • DT = determiner • NN = mass noun or singular count noun • . = sentence-final punctuation ( . ? !) • If some phrases (e.g. multi-word names) are treated as units, surrounding tags as in HTML or XML style are better: <NNP>Royal Bank of Scotland</NNP>

Why It’s Easy to an Extent but then Difficult • A given form may have several different possible tags in different contexts (e.g. “bank” can be verb or noun). • But: • Can get >90% accuracy (compared to human tagging) just by assigning each word its most common tag (as determined by statistical examination of corpora). • But: • That’s up to 1 in 10 words (~ 1 word/sentence) wrongly tagged! • Difficulties arise in trying for higher accuracy ...

Why It’s Difficult • ... Difficulties arise in trying for higher accuracy ... • A given utterance may have alternative syntactic analyses, but POS tagging typically precedes full syntactic analysis. • An utterance may break normal syntactic rules, giving strange tag sequences. • POSs may depend on each other over substantial distances in an utterance. • The needed info about how POSs pattern together may be tricky to gain from corpora: clever statistical tricks are needed to fill in gaps in what’s known. • Genre specificity: corpora used need to be either general (i.e., mixed) or ideally from the genre in question.

Types of Approach • Rule-Based (including constraint-based) • uses many hand-written rules saying what POS to choose from the possibilities for a word, given info on surrounding POSs • the earliest approach to be developed • Stochastic[what we’ll mainly look at] • often implemented via HMMs – Hidden Markov Models • the hopefully most probable sequence of tags for an utterance is worked out by standard probabilistic calculations and graph traversal • usually relies on statistics gathered from manually tagged corpora • Transformation-Based (including the popular “Brill” tagger) • tagging is done by rules, in iterative passes : the previous pass’s tagging can be transformed by rules • but the rules are learned automatically, from manually tagged corpora

Required Basics of Probability • Conditional Probability P(X|Y) • is the probability that X is the case given that Y is the case. E.g.: • P(John has a cold | John is sneezing a lot) = 0.9 • P(John is sneezing a lot | John has a cold) = 0.6 • P(a word is “the” | that word is a Determiner) = 0.27 (say) • P(a word is a Determiner | that word is “the”) = 0.99 (say) • Can estimate P(X|Y) by • i.e. The proportion of Y occasions on which X also happens • P(X|Y) = • P(X and Y) = P(X|Y) P(Y) [a great thing to remember!] • COUNT(number of occasions on which X and Y are both the case) • COUNT(number of occasions on which Y is the case) • P(X and Y) • P(Y)

Basics of Probability, contd • P(X and Y) = P(X|Y) P(Y) • But so also: P(X and Y) = P(Y|X) P(X) • So: P(X|Y) P(Y) = P(Y|X) P(X) • So: • P(X|Y) = P(Y|X)P(X) / P(Y) [Bayes’ Rule – an even better memory] • Bayes’ Rule [should be Bayes’s Rule!] used all over the place in AI and in statistical NLP. • Treated almost with religious reverence but is actually a trivial corollary of the notion of conditional probability. • P(X) is often called a “prior” probability and P(X|Y) a “posterior” probability, because often Y is some given circumstance that is viewed as modifying how likely it is that X is the case.

Basics of Probability, contd • P(X|Y) and P(Y|X) can have very different values. • E.g., P(DT|an) = P(a word is DT | that word is “an”) could be close to 1, • whereas P(an|DT) could be e.g. 0.03 • But don’t fall into the trap of thinking that they somehow vary in opposite directions. • For fixed P(X) and P(Y) [actually, for fixed ratio, P(X)/P(Y)] : • the higher that P(Y|X) is, the higher P(X|Y) is. • This fact will be a crucial intuition later.

Basics of Probability, contd • Don’t make mistake of thinking that P(X|Y)P(Y) equals P(X). • Rather: • P(X) = P(X|Y) P(Y) + P(X| ¬Y) P( ¬Y) • This is simply because • P(X) = P(X and Y) + P(X and ¬Y). • In general, if Y has n mutually exclusive sub-possibilities Y1 to Yn, • P(X) = P(X|Y1)P(Y1) + ..... + P(X|Yn)P(Yn)

“Ignoring the Bayes Denominator” • Suppose you want to find which of P(X1|Y) and P(X2|Y) is bigger, but don’t particularly need to know how big it is. • E.g., Y is that Johnny has red spots, X1 is that Johnny has measles and X2 is that Johnny has an allergy. You want to find out Johnny’s more likely condition. • It may be that P(Y|X1) and P(Y|X2) are easier to get information about. It may even be that we know enough about how medical conditions work to estimate the probabilities without gathering statistics, whereas this is more unlikely for P(X1|Y) and P(X2|Y). • We know that • P(X1|Y) = P(Y|X1) P(X1)/P(Y) and • P(X2|Y) = P(Y|X2) P(X2)/P(Y). • To find out which is the bigger value, P(Y) “cancels out” – it’s irrelevant. • So we just need to find which is the bigger of P(Y|X1) P(X1) and P(Y|X2) P(X2). • This isn’t just a mathematical trick – it’s simply that we want to know which of P(Y and X1) and P(Y and X2) is the bigger. And you might have seen that anyway from first principles!

“Which is the Biggest” • Some mathematical notation for “which is the biggest”: • If there are some values V(x) that depend on x (which varies within some specified set of values) then the value of x that makes V(x) biggest can be expressed as argmaxx(V(x)). • In the example above, we want argmaxx( P(x|Y) ) if it’s understood that x is in the set {X1, X2}. This of course generalizes to more diseases, {X1, X2, ..., Xn}. • The more/most likely diseasegiven the symptoms.

The Probability Task in POS Tagging • The task is to find • best-tagging = argmaxtagging( P(tagging| lexform-seq) ) • where lexform-seq is some given sequence of lexical forms and tagging is a proposed tagging (tag-sequence) of it, within the set of all possible taggings of that lexical form sequence. • The most likely tagging given the lexical form sequence. • We start the lexical form sequence with one or more start-of-sequence markers Ø, and the tag sequence with that many null tags Ø. • We use Bayes’ Rule to get • best-tagging = argmaxtagging( P( lexform-seq |tagging) * P(tagging )/P(lexform-seq )) • But lexform-seq is fixed, so we can “ignore the denominator,” simplifying to • best-tagging = argmaxtagging( P( lexform-seq |tagging) * P(tagging ) )

The Task in POS Tagging, contd. • So we’ve replaced the calculation of • best-tagging = argmaxtagging( P(tagging| lexform-seq) ) • by the calculation of • best-tagging = argmaxtagging( P(lexform-seq |tagging) * P(tagging ) ) • Why on earth is this an improvement?? you ask. • Because ....

We can approximate ... • best-tagging = argmaxtagging( P(lexform-seq |tagging) * P(tagging ) ) • We can approximate the prior probability P(tagging) – see following slides. • It turns out we can approximate P(lexform-seq |tagging) pretty well by pretending that, at each position in the sequence, the probability of the lexical form being there (given the whole tagging) actually depends only on the tag there; i.e. the probability at that position is just P(lexform |tag), which can be estimated from a corpus by count(lexform & tag) / count(tag). • so P( lexform-seq |tagging) is approximated as the product of the individual P(lexform |tag) values (ignoring P(lexform0 |tag0) = P(Ø| Ø) = 1) • i.e. P(lexformn|tagn) * P(lexformn-1|tagn-1) * ..... * P( lexform1|tag1) • i.e. Πni=1P( lexformi|tagi) • (Π is a product operator, exactly analogous to the summation operator Σ)

Towards P(tagging): Iterative Conditionalization • Suppose there is a set Z of possibilities, and we list them in some order (any order we like) as ZØ to Zm. Then • P(ZØ,...,Zm), i.e. P(ZØ and .... and Zm), = P(Zm | Zm-1, ... , ZØ)*P(Zm-1, ... , ZØ). • But we can do the same thing on P(Zm-1, ... ,Z0), and ultimately we get: • P(ZØ,...,Zm) = P(Zm | Zm-1, ... , ZØ) *P(Zm-1 | Zm-2, ... , ZØ) * ... * P(ZØ).

Iterative Conditionalization for Tags • One application of previous slide: consider sentences of length n plus the imaginary start-of-sequence marker Ø. For convenience we’ll include under “sentences” other meaningful lexical-form form sequences that might be used in place of sentences. • Suppose tagging = TØ,...Tn is a list of tags, with TØbeing Ø. • We ask: what is P(TØ,...,Tn) i.e. the probability of an arbitrary sentence of length n, plus Ø at the start, having the POSs (etc.) TØ,...Tn? • NB: this is not a question about any specific sentences • Answer (where Tk now means that word k, whatever it is, has tag Tk): • P(TØ,...,Tn) = P(Tn | Tn-1, ... ,TØ) * P(Tn-1 | Tn-2, ... ,TØ) *... * P(T1|TØ) * P(TØ). • But P(TØ) = 1, so • P(TØ,...,Tn) = P(Tn | Tn-1, ... ,TØ) * P(Tn-1 | Tn-2, ... ,TØ) *... * P(T1|TØ)

N-gram Approximations • An n-gram is a sequence of n lexical forms. • NLP is much concerned with the statistics of n-grams, e.g. What’s the frequency of occurrence of the 2-gram (bigram) “dog bit”. • Bigrams (and trigrams) are of special interest in the POS tagging area. • The idea extends to sequences of tags rather than lexical forms. • We can apply approximations to conditional probabilities concerning the dependence of tags on each other. • E.g. We can adopt the bigram approximation that • P(Tk | Tk-1, ... ,TØ) = P(Tk | Tk-1) • i.e. The probability of a tag occurring is dependent only on the immediately preceding tag, not earlier ones. • The trigram approximation is that • P(Tk | Tk-1, Tk-2, ... ,TØ) = P(Tk | Tk-1, Tk-2) • (We introduce two start-of-sentence markers to make sure everything works right.)

N-gram Approximations, contd • Then under the bigram approximation we get • P( TØ,...,Tn ) = P(Tn | Tn-1) * P(Tn-1 | Tn-2) *... * P(T1|TØ) • = Πni=1P( Ti | Ti-1) • So if you had such bigram probabilities about tags then you could approximate the probability of a suggested tagging. • NB: This is a “prior” probability, not dependent on any specific words being tagged. It’s just about the patterns of possible tag sequences, across the whole language. • Can estimate P( Ti | Ti-1) = P(tag at i is a tag Y | tag at i-1 is a tag X) from a corpus as • count(tag X occurs & next lex-form has tag Y) / count(tag X occurs). • NB: This estimation ignores the numerical position, i. • Similar observations concerning trigrams. (Exercise: work out the formula that replaces the one at the top of this slide.)

Putting It Together • Recall best-tagging = argmaxtagging( P(lexform-seq |tagging) * P(tagging ) ) • Then, using our approximations, • best-tagging = argmaxT1,...Tn(Πni=1P( lexformi|Ti) * Πni=1P( Ti | Ti-1) ) • i.e., • best-tagging = argmaxT1,...Tn(Πni=1[ P(lexformi|Ti) * P( Ti | Ti-1) ] )

Towards an Algorithm • best-tagging = argmaxT1,...Tn(Πni=1[ P(lexformi|Ti) * P( Ti | Ti-1) ] ) • Notice a sort of “incremental maximality” property: • Let best-tagging be BT1,...,BTn. Then, for any k, B1,...Bk must be the best(i.e., highest Πki=1value) sub-tagging that ends withBk (though not necessarily the best that ends at position k) because otherwise you could replace that sub-tagging with a better one and thus improve best-tagging itself. • You could do that replacement because the tags within it don’t affect the above P values after position k. • Above is key to an incremental algorithm that steps through the sentence finding best sub-taggings at every stage: one for each tag that is possible for the lexform at k. • For each of those tags, record the best sub-tagging ending with it using the above formula with n = k, & its probability. • To find the best sub-taggings at the next stage, we just extend the best one for each tag at the current stage: no point extending non-best ones. • Each extension step multiplies the probability by a P(lexformk+1|Tk+1) * P( Tk+1 | Tk)

The Algorithm as Special Breadth-First Search • Can be seen as a breadth-first search with special features: • Each (non-zero-valued) node is connected to all nodes at next level. • Each node on the last level is a goal node, and are all known in advance. • We have to find the best-valued path. • The value of the path is not found by adding arc costs but by multiplying arc values and also multiplying in some node values. • But if you absorbed the node values into modified arc values and took negated logarithms you could use additions to get a traditional cost. • The need to keep a record of the best-valued path to any node, and its value, is the same as the need in A* to keep a record of the least-cost path to any node together with its cost.

Its Computational Complexity • O(WT2) computation steps, where W is the number of positions (= length of the lexical-form sequence) and T is the max number of tags that need to be considered at a position in the given sequence. • T needs to count all tags at a position, unless we can definitely exclude some. • If didn’t have the “iterative maximality” property above, might have had to compute all possible tag sequences, of which there are up to TW

“Viterbi” Algorithm • The algorithm is an example of a so-called “Viterbi” algorithm. • Such algorithms used in various sectors of NLP. • It’s actually just an application of a widely known technique called Dynamic Programming (see J&M). • Toy version of the algorithm (by Dr Hancox) is available via my slides page.

Hidden Markov Model • The above network is an example of a Hidden Markov Model. • A Markov Model is like a Finite State Network but with probabilities attached to the arcs (transitions). In our case: • The states are tags-at-positions. • The transition probabilities are the P( Ti | Ti-1) values. • We don’t really need to label the arcs, but to relate more closely to FSNs, we can regard the arc label as being the relevant tag pair. • A Hidden Markov Model is a Markov Model plus the idea that you can’t observe the states themselves but only derived information (the lexical-forms in our case), where that observable information is related by a conditional probability (= “observation likelihood” or “emission probability”) to the state. In our case, for each state: • the lexform at the state’s position in sequence , & the P(lexform| tag) value.

POS Tagging as HMM by Viterbi • So our POS tagging is being done by converting the task into a standard thing to do with an HMM (find the best hidden path given the observations – the “decoding” task); and we do it by means of a standard (“Viterbi”) algorithm.

Refinements • Can include also an end-of-sequence marker/tag pair Ø/Ø, to take care of constraints on how a sentence can end as well as on how it can start. • Can make the P(lexform-seq| tagging) approximation cleverer, • e.g. by replacing the lexical-form likelihoods—the P(lexform| tag) values—by P(lexform| tag-before, tag, tag-after), and can even include neighbouring lexical forms in the condition. • Contd ...

Refinements • If the hand-tagged corpus doesn’t provide a particular lexformfor a particulartag, then P(lexform| tag) is estimated as zero, but it could be wise to assign it a tiny value just in case, unless you know it definitely cannot happen (NB: but can you ever know this, given the possibility of errors in the corpus??). • Similarly, if the hand-tagged corpus doesn’t contain a particular tag bigram or trigram, this may be because it’s impossible, but it’s safer to assume it’s possible but rare. So ....

Coping with missing tag n-grams • If bigram Ti , Ti-1 is missing, can give a tiny value (e.g., 0.0001) to P( Ti | Ti-1). Similarly for trigrams. • More sophisticated: use the probability for the next shorter n-gram (even 1-gram – i.e. single lex-form). E.g., • if Ti , Ti-1 missing, use P( Ti) instead of P( Ti | Ti-1). • if Ti , Ti-1 , Ti-2missing, use P( Ti | Ti-1) instead of P( Ti | Ti-1, Ti-2). • Yet more sophisticated: combine such estimators in a weighted formula, e.g. for trigrams use • λP( Ti | Ti-1, Ti-2) +μP( Ti | Ti-1) + νP( Ti) • See J&M on how to choose the coefficients λ, μ, andν.

Part of Speech Tagging