Natural Language Processing

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan

Course Book Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze

Chapter 10 Part-of-Speech Tagging March 12, 2007

Tagging used in this chapter Example sentence: The-AT representative-NN put-VBD Chairs-NNS on-IN the-AT table-NN. Tagging is simply a limited case of syntactic disambiguation. No attempt is made here to do a complete parse.

Performance of POS tagging systems • Some of the more successful algorithms have shown 96-97% accuracy! • POS tagging is useful for variety of application such as information extraction, shallow parsing, question answering etc. where there is no need to fully comprehend what is being said but nonetheless can support useful applications

Information Sources in Tagging • Disambiguating in part-of-speech tagging • The AT JJ NN (article adjective noun) is more common than say AT JJ VBP (article adjective verb). • So, in “a new play” we should assume play is noun instead of a verb. • The above is referred to as “syntagmatic structural” information.

Lexical approach • English allows almost any noun to become a verb… • Next you flour the pan, … • I want you to web our annual report… • “web” and “flour” are typically nouns but used in the verb form. • However, simply looking at the word and deciding its most dominant part-of-speech gives about 90% accuracy in tagging! • The tagging used in practice combines both these approaches.

Markov Model Taggers • Limited Horizon. (tj refers to the jth tag in the tag set) • Time invariant (stationary) • For example, if a finite verb has a probability of 0.2 to occur after a pronoun at the beginning of the sentence , this probability does not change as the rest of the sentence (or text) is parsed.

Limited Horizon Based on above notation:

Estimating probabilities • Use training set of tagged data to compute the following (transition probability akj): • For instance, one would expect to find for “a new play” • P(NN|JJ) >> P(VBP|JJ) • And in Brown corpus, P(NN|JJ) =~ 0.45 & P(VBP|JJ) =~0.0005

Using the Hidden Markov Model • In the tagged (training) corpus, the hidden model is the tagging associated with each word. As the tagging changes from one tag to another (NN to JJ to VBJ etc), the observed model is the sequence of words. So the formulation of the model here is, given the observed sequence of words, what is the corresponding tag-sequence? Exactly the formulation/question studied in chapter 9.

Using maximum likelihood estimators Probability that word wl is emitted when a particular tag tj occurs: So to determine the best tagging t1,n for a sentence w1,n, applying Bayes’ rule:

Algorithm used with training data

Finding the best state sequenceViterbi algorithm From Chapter 9 Store the most probable path that leads to a given node Initialization Induction Store Backtrace From Chapter 9

Dealing with unknown words • What to do with words that are not in the training data set?

Trigram Taggers • Model considered so far – bigram tagger • Trigram tagger uses more context and can provide better results. • However, more context is not necessarily a good thing. • Part-of-speech tagging that occurs before and after a comma are usually unrelated.

Applying HMMs to POS tagging • When no training data is available, the use of HMM becomes tricky. • Dictionary information can be used to seed the probabilities. • If a particular tag is not mentioned in the dictionary, i.e. book does not list JJ (adjective) as a part-of-speech, the emission probability of P(book|JJ) is set to zero.

Jelinek’s method bj.l probability that word (or word class l) is emitted by tag j.

Kupiec’s method

Comments • Once the initial probabilities are estimated, training using forward-backward algorithm can be used. • If large training data is available, then follow the steps outlined before. • If training and test data are different, forward-backward algorithm can be used

Transformation-based learning of tags • As input data, need tagged corpus and a dictionary • A learning algorithm that constructs a ranked list of transformations rules that will take an dictionary based tag assignment to the ones shown in the tagged training data.

Transformation-based learning of tags • Three types of rules • Triggered by specific tags • Triggered by specific words • Morphology-based (for unknown words)

Learning Algorithm • Initially tag each word with its most frequent tag • Iteratively, refine the tag to reduce the overall error (defined as number of tokens with incorrect tags) • Essentially this process learns a collection of rewrite rules.

Finite state transducer • The rule set from the transformation-based tagger can be mapped to a finite state transducer to basically take an input stream of words and output a set of tags. • This process can be pretty fast.

Applications of tagging • Partial parsing • Information Extraction • Question answering

Natural Language Processing