1 / 64

Part of Speech Tagging

Part of Speech Tagging. September 9, 2005. Part of Speech. Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word. Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech:

Télécharger la présentation

Part of Speech Tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part of Speech Tagging September 9, 2005

  2. Part of Speech • Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word. • Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech: • noun, verb, pronoun, preposition, adjective, adverb, conjunction, article • These categories are based on morphological and distributional similarities (not semantic similarities). • Part of speech is also known as: • word classes • morphological classes • lexical tags CAP6640 Natural Language Systems

  3. Part of Speech (cont.) • A POS tag of a word describes the major and minor word classes of that word. • A POS tag of a word gives a significant amount of information about that word and its neighbours. For example, a possessive pronoun (my, your, her, its) most likely will be followed by a noun, and a personal pronoun (I, you, he, she) most likely will be followed by a verb. • Most of words have a single POS tag, but some of them have more than one (2,3,4,…) • For example, book/noun or book/verb • I bought a book. • Please book that flight.

  4. Tag Sets • There are various tag sets to choose. • The choice of the tag set depends on the nature of the application. • We may use small tag set (more general tags) or • large tag set (finer tags). • Some of widely used part-of-speech tag sets: • Penn Treebank has 45 tags • Brown Corpus has 87 tags • C7 tag set has 146 tags • In a tagged corpus, each word is associated with a tag from the used tag set.

  5. English Word Classes • Part-of-speech can be divided into two broad categories: • closed class types -- such as prepositions • open class types -- such as noun, verb • Closed class words are generally also function words. • Function words play important role in grammar • Some function words are: of, it, and, you • Functions words are most of time very short and frequently occur. • There are four major open classes. • noun, verb, adjective, adverb • a new word may easily enter into an open class. • Word classes may change depending on the natural language, but all natural languages have at least two word classes: noun and verb.

  6. Nouns • Nouns can be divided as: • proper nouns -- names for specific entities such as India, John, Ali • common nouns • Proper nouns do not take an article but common nouns may take. • Common nouns can be divided as: • count nouns -- they can be singular or plural -- chair/chairs • mass nouns -- they are used when something is conceptualized as a homogenous group -- snow, salt • Mass nouns cannot take articles a and an, and they can not be plural.

  7. Verbs • Verb class includes the words referring actions and processes. • Verbs can be divided as: • main verbs -- open class -- draw, bake • auxiliary verbs -- closed class -- can, should • Auxiliary verbs can be divided as: • copula -- be, have • modal verbs -- may, can, must, should • Verbs have different morphological forms: • non-3rd-person-sg eat • 3rd-person-sg - eats • progressive -- eating • past -- ate • past participle -- eaten

  8. Adjectives • Adjectives describe properties or qualities • for color -- black, white • for age -- young, old • In Turkish, all adjectives can also be used as noun. • kırmızı kitap red book • kırmızıyı the red one (ACC)

  9. Adverbs • Adverbs normally modify verbs. • Adverb categories: • locative adverbs -- home, here, downhill • degree adverbs -- very, extremely • manner adverbs -- slowly, delicately • temporal adverbs -- yesterday, Friday • Because of the heterogeneous nature of adverbs, some adverbs such as Friday may be tagged as nouns.

  10. Major Closed Classes • Prepositions -- on, under, over, near, at, from, to, with • Determiners -- a, an, the • Pronouns -- I, you, he, she, who, others • Conjunctions -- and, but, if, when • Participles -- up, down, on, off, in, out • Numerals -- one, two, first, second

  11. Prepositions • Occur before noun phrases • indicate spatial or temporal relations • Example: • on the table • under chair • Frequent. For example, some of the frequency counts in a 16 million word corpora (COBUILD). • of 540,085 • in 331,235 • for 142,421 • to 125,691 • with 124,965 • on 109,129 • at 100,169

  12. Particles • A particle combines with a verb to form a larger unit called phrasal verb. • go on • turn on • turn off • shut down

  13. Articles • A small closed class • Only three words in the class: a an the • Marks definite or indefinite • They occur so often. For example, some of the frequency counts in a 16 million word corpora (COBUILD). • the 1,071,676 • a 413,887 • an 59,359 • Almost 10% of words are articles in this corpus.

  14. Conjunctions • Conjunctions are used to combine or join two phrases, clauses or sentences. • Coordinating conjunctions -- and or but • join two elements of equal status • Example: you and me • Subordinating conjunctions -- that who • combines main clause with subordinate clause • Example: • I thoughtthat you might like milk

  15. Pronouns • Shorthand for referring to some entity or event. • Pronouns can be divided: • personal you she I • possessive my your his • wh-pronouns who what

  16. TagSets for English • There are popular actual tagsets for part-of-speech • PENN TREEBANK tagset has 45 tags • IN preposition/subordinating conj. • DT determiner • JJ adjective • NN noun, singular or mass • NNS noun, plural • VB verb, base form • VBD verb, past tense • A sentence from Brown corpus which is tagged using Penn Treebank tagset. • The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

  17. Part of Speech Tagging • Part of speech tagging is simply assigning the correct part of speech for each in an input sentence • We assume that we have the following: • A set of tags (our tag set) • A dictionary that tells us the possible tags for each word (including all morphological variants). • A text to be tagged. • There are different algorithms for tagging. • Rule Based Tagging – uses hand-written rules • Stochastic Tagging – uses probabilities computed from training corpus • Transformation Based Tagging – uses rules learned automatically

  18. How hard is tagging? • Most words in English are unambiguous. They have only a single tag. • But many of most common words are ambiguous: • can/verb can/auxiliary can/noun • The number of word types in Brown Corpus • unambiguous (one tag) 35,340 • ambiguous (2-7 tags) 4,100 • 2 tags 3760 • 3 tags 264 • 4 tags 61 • 5 tags 12 • 6 tags 2 • 7 tags 1 • While only 11.5% of word types are ambiguous, over 40% of Brown corpus tokens are ambiguous.

  19. Problem Setup • There are M types of POS tags • Tag set: {t1,..,tM}. • The word vocabulary size is V • Vocabulary set: {w1,..,wV}. • We have a word sequence of length n: W = w1,w2…wn • Want to find the best sequence of POS tags: T = t1,t2…tn

  20. Information sources for tagging All techniques are based on the same observations… • some tag sequences are more probable than others • ART+ADJ+N is more probable than ART+ADJ+VB • Lexical information: knowing the word to be tagged gives a lot of information about the correct tag • “table”: {noun, verb} but not a {adj, prep,…} • “rose”: {noun, adj, verb} but not {prep, ...}

  21. Rule-Based Part-of-Speech Tagging • First Stage: Uses a dictionary to assign each word a list of potential parts-of-speech. • Second Stage: Uses a large list of handcrafted rules to window down this list to a single part-of-speech for each word. • The ENGTWOL is a rule-based tagger • In the first stage, uses a two-level lexicon transducer • In the second stage, uses hand-crafted rules (about 1100 rules)

  22. Sample rules N-IP rule: A tag N (noun) cannot be followed by a tag IP (interrogative pronoun) ... man who … • man: {N} • who: {RP, IP} --> {RP} relative pronoun ART-V rule: A tag ART (article) cannot be followed by a tag V (verb) ...the book… • the: {ART} • book: {N, V} --> {N}

  23. After The First Stage • Example: He had a book. • After the fırst stage: • he he/pronoun • had have/verbpast have/auxliarypast • a a/article • book book/noun book/verb

  24. Tagging Rule Rule-1: if (the previous tag is an article) then eliminate all verb tags Rule-2: if (the next tag is verb) then eliminate all verb tags

  25. Transformation-based tagging • Due to Eric Brill (1995) • basic idea: • take a non-optimal sequence of tags and • improve it successively by applying a series of well-ordered re-write rules • rule-based • but, rules are learned automatically by training on a pre-tagged corpus

  26. An example 1. Assign to words their most likely tag • P(NN|race) = .98 • P(VB|race) = .02 2. Change some tags by applying transformation rules

  27. Types of context • lots of latitude… • can be: • tag-triggered transformation • The preceding/following word is tagged this way • The word two before/after is tagged this way • ... • word- triggered transformation • The preceding/following word this word • … • morphology- triggered transformation • The preceding/following word finishes with an s • … • a combination of the above • The preceding word is tagged this ways AND the following word is this word

  28. Learning the transformation rules • Input: A corpus with each word: • correctly tagged (for reference) • tagged with its most frequent tag (C0) • Output: A bag of transformation rules • Algorithm: • Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0 • Change tag a to tag b when… • The preceding/following word is tagged z • The word two before/after is tagged z • One of the 2 preceding/following words is tagged z • One of the 2 preceding words is z • …

  29. Learning the transformation rules (con't) • Run the initial tagger and compile types of errors • <incorrect tag, desired tag, # of occurrences> • For each error type, instantiate all templates to generate candidate transformations • Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces • Save the transformation that yields the greatest improvement • Stop when no transformation can reduce the error rate by a predetermined threshold

  30. Example • if the initial tagger mistags 159 words as verbs instead of nouns • create the error triple: <verb, noun, 159> • Suppose template #3 is instantiated as the rule: • Change the tag from <verb> to <noun> if one of the two preceding words is tagged as a determiner. • When this template is applied to the corpus: • it corrects 98 of the 159 errors • but it also creates 18 new errors • Error reduction is 98-18=80

  31. Learning the best transformations • input: • a corpus with each word: • correctly tagged (for reference) • tagged with its most frequent tag (C0) • a bag of unordered transformation rules • output: • an ordering of the best transformation rules

  32. Learning the best transformations (con’t) let: • E(Ck) = nb of words incorrectly tagged in the corpus at iteration k • v(C) = the corpus obtained after applying rule v on the corpus C ε = minimum number of errors desired for k:= 0 step 1 do bt := argmint (E(t(Ck))// find the transformation t thatminimizes // the error rate if ((E(Ck) - E(bt(Ck))) < ε)// if bt does not improve the taggingsignificantly then goto finished Ck+1 := bt(Ck)// apply rule bt to the current corpus Tk+1 := bt// bt will be kept as the currenttransformation // rule end finished: the sequence T1 T2 … Tk is the ordered transformation rules

  33. Strengths of transformation-based tagging • exploits a wider range of lexical and syntactic regularities • can look at a wider context • condition the tags on preceding/next words not just preceding tags. • can use more context than bigram or trigram. • transformation rules are easier to understand than matrices of probabilities

  34. How TBL Rules are Applied • Before the rules are applied the tagger labels every word with its most likely tag. • We get these most likely tags from a tagged corpus. • Example: • He is expected to race tomorrow • he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN • After selecting most-likely tags, we apply transformation rules. • Change NN to VB when the previous tag is TO • This rule converts race/NN into race/VB • This may not work for every case • ….. According to race

  35. How TBL Rules are Learned • We will assume that we have a tagged corpus. • Brill’s TBL algorithm has three major steps. • Tag the corpus with the most likely tag for each (unigram model) • Choose a transformation that deterministically replaces an existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations. • Apply the transformation to the training corpus. • These steps are repeated until a stopping criterion is reached. • The result (which will be our tagger) will be: • First tags using most-likely tags • Then apply the learned transformations

  36. Transformations • A transformation is selected from a small set of templates. Change tag a to tag b when - The preceding (following) word is tagged z. - The word two before (after) is tagged z. - One of two preceding (following) words is tagged z. - One of three preceding (following) words is tagged z. - The preceding word is tagged z and the following word is tagged w. - The preceding (following) word is tagged z and the word two before (after) is tagged w.

  37. Stochastic POS tagging • Assume that a word’s tag only depends on the previous tags (not following ones) • Use a training set (manually tagged corpus) to: • learn the regularities of tag sequences • learn the possible tags for a word • model this info through a language model (n-gram)

  38. Hidden Markov Model (HMM) Taggers • Goal: maximize P(word|tag) x P(tag|previous n tags) • P(word|tag) • word/lexical likelihood • probability that given this tag, we have this word • NOT probability that this word has this tag • modeled through language model (word-tag matrix) • P(tag|previous n tags) • tag sequence likelihood • probability that this tag follows these previous tags • modeled through language model (tag-tag matrix) Lexical information Syntagmatic information

  39. Tag sequence probability • P(tag|previous n tags) • if we look (n-1) tags before to find current tag --> n-gram model • trigram model • chooses the most probable tag ti for word wi given: • the previous 2 tags ti-2 & ti-1 and • the current word wi • bigram model • chooses the most probable tag ti for word wi given: • the previous tag ti-1 and • the current word wi • unigram model (just most-likely tag) • chooses the most probable tag ti for word wi given: • the current word wi

  40. Example • “race” can be VB or NN • “Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV” • “People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NNfor/IN outer/JJ space/NN” • let’s tag the word “race” in 1st sentence with a bigram model.

  41. Example (con’t) • assuming previous words have been tagged, we have: “Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow” • P(race|VB) x P(VB|TO) ? • given that we have a VB, how likely is the current word to be race • given that the previous tag is TO, how likely is the current tag to be VB • P(race|NN) x P(NN|TO) ? • given that we have a NN, how likely is the current word to be race • given that the previous tag is TO, how likely is the current tag to be NN

  42. Example (con’t) • From the training corpus, we found that: • P(NN|TO) = .021// given that the previous tag is TO // 2.1% chances that the current tag is NN • P(VB|TO) = .34// given that the previous tag is TO // 34% chances that the current tag is VB • P(race|NN) = .00041// given that we have an NN // 0.041% chances that this word is "race" • P(race|VB) = .00003// given that we have a VB // 003% chances that this word is "race" so: P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01 P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009 so: VB is more probable!

  43. Example (con’t) • and by the way: race is 98% of the time a NN !!! P(VB|race) = 0.02 P(NN|race) = 0.98 !!! • How are the probabilities found ? • using a training corpus of hand-tagged text • long & meticulous work done by linguists

  44. HMM Tagging • But HMM tagging tries to find: • the best sequence of tags for a sentence • not just best tag for a single word • goal: maximize the probability of a tag sequence, given a word sequence • i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence)

  45. HMM Tagging (con’t) • By Bayes law: • wordSeq is given… • so P(wordSeq) will be the same for all tagSeq • so we can drop it from the equation

  46. Assumptions in HMM Tagging • words are independent • Markov assumption (approximation to short history) • ex. with bigram approximation: • probability of a word is only dependent on its tag emission probability state transition probability

  47. The derivation bestTagSeq = argmax P(tagSeq) x P(wordSeq|tagSeq) (t1…tn)* = argmax P( t1, …, tn ) x P(w1, …, wn | t1, …, tn ) Assumption 1: Independence assumption + Chain rule P(t1, …, tn) x P(w1, …, wn | t1, …, tn) = P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn) Assumption 2: Markov assumption: only look at short history (ex. bigram) = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn) Assumption 3: A word’s identity only depends on its tag = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1) x P(w2 | t2) x P(w3 | t3) x … x P(wn | tn)

  48. Emissions & Transitions probabilities • let • N: number of possible tags (size of tag set) • V: number of word types (vocabulary) • from a tagged training corpus, we compute the frequency of: • Emission probabilities P(wi| ti) • stored in an N x V matrix • emission[i,j] = probability that tag i is the correct tag for word j • Transitions probabilities P(ti|ti-1) • stored in an N x N matrix • transmission[i,j] = probability that tag i follows tag j • In practice, these matrices are very sparse • So these models are smoothed to avoid zero probabilities

  49. Emission probabilities P(wi| ti) • stored in an N x V matrix • emission[i,j] = probability/frequency that tag i is the correct tag for word j

  50. Transitions probabilities P(ti|ti-1) • stored in an N x N matrix • transmission[i,j] = probability/frequency that tag i follows tag j

More Related