1 / 47

Word classes and the distribution of words, and Part of Speech tagging

Word classes and the distribution of words, and Part of Speech tagging. Computational linguistics. Eats shoots and leaves.

tuvya
Télécharger la présentation

Word classes and the distribution of words, and Part of Speech tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word classes andthe distribution of words,and Part of Speech tagging Computational linguistics

  2. Eats shoots and leaves • We move on to finding higher level structure of natural language. Most of this is expressed in terms of categories – categories of words, and categories of sequences of words called phrases.

  3. Classes or categories of words • Roughly: words whose distributions are very similar. Two words are in the same categories iff we can substitute one for the other in a sentence and preserve grammaticality. • We will return to this question.

  4. Open and closed classes • Open: Noun, verb, adjective. • Closed: preposition, adverbs, conjunctions. • Open: large classes, and more words can be added to them. • Closed: small classes, and they are resistant to adding new members. A new preposition?

  5. Two points of view What’s real and central in grammar are notions like Noun and Verb (and Noun Phrase and Verb Phrase). Then we find real nouns, like dog and John and Monday. Many of them are good nouns, but some of them are defective; they don’t “do” all the things that they “should do”.

  6. 2nd point of view What’s real are sentences (or corpora): John is leaving Wednesday with his dog. When we look at a language, we find an enormous range of “places” where a given word can appear. (“Places” meaning environments, perhaps meanings). No two words are quite alike, but words do form clusters with regard to their grammatical behavior. For example, ...

  7. The days of the week (Monday…Sunday) share a lot in common. We can simplify our description by generalizing over that set of words. John left __. John left last __. John leaves next __. He leaves on __. You must do it before__. Do it by __. Your horoscope for __. __’s weather forecast. The __ after Christmas. * at__. * to __. *saw__. *We__. *I __.

  8. Proper given names Likewise, Proper given names (John, Jerry, …). As we form larger and larger classes, there are fewer things that they have in common. How do these J-words (!) differ from other “nouns”? Rarely take articles (the Jim) or relative clauses or adjectives (Mary who bought a book), but they certainly can: the Jim I went to elementary school with, the Bush who made those campaign promises, a fresh and smiling Ralph Nader)

  9. Back to first view • Grammar consists of a set of non-terminal nodes, terminal nodes, a set of context-free expansion rules, and a lexicon, at the least. • Depending on your analysis, also a set of transformations. • Syntax is responsible for the generation of phrase-structures, whose terminal nodes are lexical categories. • Lexical categories are expanded to words of the appropriate category.

  10. Syntax • Non-terminal categories: two correspond to semantic primitives (proposition and term); these are Sentence (S) and Noun Phrase (NP). • Terminals: the categories into which words are put. Perhaps these are universal, perhaps they aren’t. (Some) Linguists tend to think they are; computational linguists tend to think they aren’t. • Non-terminals based on terminal categories. Noun begets Noun Phrase, Adjective begets Adjective Phrase, etc. • Context-free phrase structure rules: Non-terminal node expands to both non-terminals and terminal nodes. • Terminals are expanded to words (“lexical elements”, in the parlance).

  11. S INFL NP VP might N VP V be V John sleeping

  12. Syntactic rules • S → NP + INFL + VP • INFL → { can, could, may, might, will, should, do } • VP → ( Advnot ) VP • VP → V NP NP PP* • VP → VP AdvP[hrase] • VP → V (NP) S: allows for recursive structure: sentences within sentences, of unbounded length.

  13. S → NP + INFL + VP S has other expansions in English, such as in infinitives; there, an INFL with to is found, but no tense, no auxiliary verbs, no dummy do. S → NP + [INFL to ] + VP It is important for John to leave, but not …*for John to should leave, …*for John should to leave, etc.

  14. NP → det AdjP • → N PrepP NP AP det PP A NP P The former king N of England Head of NP

  15. N is head of NP • The semantically central word: A big book is a book. And the one whose form is determined by the governing verb in a case-marking language, and the one that determines the number and gender of any words that agree with the NP.

  16. Categories We have 4 things in mind when we make them: 1. (Lexical categories): Morphological structure 2. Meaning (semantics) 3. External distribution 4. (Phrasal categories): internal distribution ...

  17. Morphology • What suffixes may appear with a given stem: ‘s, NULL, s; • ed, s, ing, ed • er, est, ness

  18. Meaning • Reference to objects in the world • Reference to n-ary predicates: • unary: tall, sleep • binary: eat (human, food), saw (human, object) • ternary: give (human, human, object)

  19. External distribution Roughly speaking: this means, what this word (or phrase) can appear next to (before, after). Nouns appear after articles (=noun determiners, nominal determiners), after adjectives. before Prepositinal Phrase complements. the dog, my dog, the taste of champagne, the war of the worlds

  20. Internal distribution (phrases) • A “noun phrase” has three parts: a determiner, followed by an adjective, followed by a noun. • Some of these are “optional”: that is, we may still call something an noun phrase even if not all 3 are present.

  21. Back to categories for words Noun properties (?English): • Takes articles • Takes preceding adjectives • May appear as subject of a sentence • May appear as object of a preposition • Has singular and plural form; plural is realized as /s/ • Refers to an object or set of objects • May take possessive ‘s • May serve as antecedent to a pronoun

  22. Verb • Has present-tense form (-s in 3rd singular) • Has past-tense form (-ed) • Agrees with its subject noun phrase • Refers to a predicate (1 or more arguments) • Follows the subject immediately • Appears at the beginning of a verb-phrase

  23. Lexical categories in language • One view is that there is a small number of categories, and they can be identified across languages. (I think most people believe that.) • The core criterion for membership is semantic, and the only effective way of identifying across languages is semantic. • All languages have a category of phrases that refer to things (NP), and one that expresses propositions (S).

  24. Nouns and pronouns • Nouns in many languages are inflected for number and case. • Case: Nominative, accusative, genitive, dative, and often others. • Pronouns, but not nouns, in English are inflected for case: nominative, genitive, and accusative (or other).

  25. Pronouns

  26. Penn Treebank noun categories NN noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... NNP noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ... NNPS noun, proper, plural Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques Apache Apaches Apocrypha ... NNS noun, common, plural undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ...

  27. Along with nouns… • Determiners: • Articles (a,an,the): definite, indefinite • Possessive pronouns (my, your, his…) • Demonstrative determiners: this, that… • Adjectives • In many languages, agree with the noun that they modify for case and number (but not in English). Spanish: l-a-s mes-a-s pequeñ-a-s ‘the tables small-fem-plural’

  28. Adjectives • Absolute (or positive) form: big • Comparative: biggerYour car is bigger than theirs. • Superlative: Of these cars, John’s car is the biggest.

  29. Quantifiers • Often appear in pre-noun positions, inside the Noun Phrase • Express notions of “some, all, none” • May be pre-noun modifiers, or a full NP (like pronouns): something, anyone, etc. (Are these really two words stuck together?) • Question and relative clause words: who, what, where, when, why, whose, which.

  30. Relative clauses in English:that-Comp, gap in clause NP S’ NP The thing Comp S (that) I saw [e] that is option if gap is not in subject position. [e] marks the “gap”

  31. Relative clauses in English:wh-phrase NP S’ NP The ideas Comp S which I disagree with [e]

  32. Relative clauses in English:wh-phrase w/ pied-piping of P NP S’ NP The ideas Comp S with which I disagree [e]

  33. Relative clause formation can rip out of embedded clauses NP S’ NP The ideas Comp S with which Your manager said S You disagree [e]

  34. Verbs • Verbs are words that refer to actions, and which are the essential component of most sentences. • There are non-verbal sentences, but they are relatively infrequent. Most frequent of these: Linking a noun (NP) with an adjective or a location. English uses the copula (to be) for this function.

  35. Verbs • Have an argument structure: typically 1, 2, or 3 nominal arguments. • 1 argument: typically the subject NP. Intransitive verb: John slept/arrived/left/yawned. The door opened.The phone rang. • 0 arguments?

  36. Verb arguments • 2 arguments (transitive): Subject and direct object, usually: Kim shut the door, helped the students, wrote a book. • 3 arguments (ditransitive): Subject, indirect object, direct object: Kim gave Terry a book/a hand/a hard time.

  37. Syntactic/semantic ambiguities • I saw the man with the telescope. • Time flies like an arrow.

  38. S VP NP V NP I saw Det N’ the N PP man P NP with det N’ N telescope

  39. S VP NP PP V NP I P NP saw Det N’ with det N’ the N N man telescope

  40. Part of Speech tagging An attempt to assign categories to words without doing a whole syntactic parse: Getting a whole parse is extremely difficult; Much of the difficulty is the constituency, not the part of speech tagging.

  41. High frequency words are the most ambiguous regarding PoS • table • like • I like ice cream • I like things like ice cream • I’ve been there like 100 times. • People like him. • People like him are obnoxious.

  42. Taggers • Start with a lexicon with ranges of PoSs • each word is marked with its range of permitted PoS • an OOV word is given a PoS based on its morphology, if we’re lucky • A mechanism finds the best combination of PoS, given the order of the words.

  43. The Det design N Vpres Vinf Vimperative of Prep taggers Nplural is Vpres often Adv based VpastParticiple VpastTense on Adverb Preposition what WhPronoun WhDeterminer is Vpresent known Vpast tense about Adverb Prep the Det lexicon Noun . punctuation

  44. History of PoS tagging • First large scale system in 1971: TAGGIT (Greene and Rubin): 71 items in tag set, based on 3,300 hand-written rules, using a window of up to 5 words of the word being disambiguated. But almost all of the rules looked at immediate neighbors.

  45. CLAWS1 • Part of the annotation of the Lancaster-Oslo/Bergen corpus; produced at the University of Lancaster. • Used largely statistical techniques rather than hand-crafted rules, trained off a tagged 200K words of the Brown corpus. • 96-97% accuracy of top PoS guess. • Used an open (not hidden) Markov model

  46. Markov model I’m not sure that this is exactly the model that CLAWS used, but it’s in the spirit: p(W[i..n] & PoS[i..n]) =

More Related