1 / 40

CS60057 Speech &Natural Language Processing

CS60057 Speech &Natural Language Processing. Autumn 2005. Lecture 4 28 July 2005. Morphology. Study of the rules that govern the combination of morphemes. Inflection: same word, different syntactic information Run/runs/running, book/books Derivation: new word, different meaning

cherie
Télécharger la présentation

CS60057 Speech &Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS60057Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005 Natural Language Processing

  2. Morphology • Study of the rules that govern the combination of morphemes. • Inflection: same word, different syntactic information • Run/runs/running, book/books • Derivation: new word, different meaning • Often different part of speech, but not always • Possible/possibly/impossible, happy/happiness • Compounding: new word, each part is a word • Blackboard, bookshelf • lAlakamala, banabAsa Natural Language Processing

  3. Morphology Level: The Mapping • Formally: A+  2(L,C1,C2,...,Cn) • A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes) • L is the set of possible lemmas, uniquely identified • Ci are morphological categories, such as: • grammatical number, gender, case • person, tense, negation, degree of comparison, voice, aspect, ... • tone, politeness, ... • part of speech (not quite morphological category, but...) • A, L and Ci are obviously language-dependent

  4. Bengali/Hindi Inflectional Morphology • Certain languages encode more syntax in morphology than in syntax • Some of inflectional suffixes that nouns can have: • singular/plural : • Gender • possessive markers : • case markers : • Different Karakas • Inflectional suffixes that verbs can have: • Hindi: Tense, aspect, modality, person, gender, number • Bengali: Tense, aspect, modality, person • Order among inflectional suffixes (morphotactics ) • Chhelederke • baigulokei Natural Language Processing

  5. Bengali/ Hindi Derivational Morphology • Derivational morphology is very rich. Natural Language Processing

  6. English Inflectional Morphology • Nouns have simple inflectional morphology. • plural -- cat / cats • possessive -- John / John’s • Verbs have slightly more complex inflectional, but still relatively simple inflectional morphology. • past form -- walk / walked • past participle form -- walk / walked • gerund -- walk / walking • singular third person -- walk / walks • Verbs can be categorized as: • main verbs • modal verbs -- can, will, should • primary verbs -- be, have, do • Regular and irregular verbs: walk / walked -- go / went Natural Language Processing

  7. Regulars and Irregulars • Some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t. Natural Language Processing

  8. Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut Natural Language Processing

  9. Derivational Morphology renationalizationability • Quasi-systematicity • Irregular meaning change • Changes of word class • Some English derivational affixes • -ation : transport / transportation • -er : kill / killer • -ness : fuzzy / fuzziness • -al : computation / computational • -able : break / breakable • -less : help / helpless • un : do / undo • re : try / retry Natural Language Processing

  10. Derivational Examples • Verb/Adj to Noun Natural Language Processing

  11. Derivational Examples • Noun/Verb to Adj Natural Language Processing

  12. Compute • Many paths are possible… • Start with compute • Computer -> computerize -> computerization • Computation -> computational • Computer -> computerize -> computerizable • Compute -> computee Natural Language Processing

  13. Parts of A Morphological Processor • For a morphological processor, we need at least followings: • Lexicon: The list of stems and affixes together with basic information about them such as their main categories (noun, verb, adjective, …) and their sub-categories (regular noun, irregular noun, …). • Morphotactics : The model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word. • Orthographic Rules (Spelling Rules) : These spelling rules are used to model changes that occur in a word (normally when two morphemes combine). Natural Language Processing

  14. Lexicon • A lexicon is a repository for words (stems). • They are grouped according to their main categories. • noun, verb, adjective, adverb, … • They may be also divided into sub-categories. • regular-nouns, irregular-singular nouns, irregular-plural nouns, … • The simplest way to create a morphological parser, put all possible words (together with its inflections) into a lexicon. • We do not this because their numbers are huge (theoretically for Turkish, it is infinite) Natural Language Processing

  15. Morphotactics • Which morphemes can follow which morphemes. Lexicon: regular-nounirregular-pl-nounirreg-sg-nounplural fox geese goose -s cat sheep sheep dog mice mouse • Simple English Nominal Inflection (Morphotactic Rules) plural (-s) 1 reg-noun 2 irreg-sg-noun 0 irreg-pl-noun Natural Language Processing

  16. o x f a t c s d o g s p h e e e g o s o e e e m o u s i c Combine Lexicon and Morphotactics This only says yes or no. Does not give lexical representation. It accepts a wrong word (foxs). Natural Language Processing

  17. FSAs and the Lexicon • This will actual require a kind of FSA : the Finite State Transducer (FST) • We will give a quick overview • First we’ll capture the morphotactics • The rules governing the ordering of affixes in a language. • Then we’ll add in the actual words Natural Language Processing

  18. Simple Rules Natural Language Processing

  19. Adding the Words Natural Language Processing

  20. Derivational Rules Natural Language Processing

  21. Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL” Natural Language Processing

  22. Why care about morphology? • `Stemming’ in information retrieval • Might want to search for “aardvark” and find pages with both “aardvark” and “aardvarks” • Morphology in machine translation • Need to know that the Spanish words quiero and quieres are both related to querer ‘want’ • Morphology in spell checking • Need to know that misclam and antiundoggingly are not words despite being made up of word parts Natural Language Processing

  23. Can’t just list all words • Turkish word Uygarlastiramadiklarimizdanmissinizcasin • `(behaving) as if you are among those whom we could not civilize’ • Uygar `civilized’ + • las `become’ + • tir `cause’ + • ama `not able’ + • dik `past’ + • lar ‘plural’+ • imiz ‘p1pl’ + • dan ‘abl’ + • mis ‘past’ + • siniz ‘2pl’ + • casina ‘as if’ Natural Language Processing

  24. Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL” Natural Language Processing

  25. +N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s Natural Language Processing

  26. Lexical to Intermediate Level Natural Language Processing

  27. FST Properties • FSTs are closed under: union, inversion, and composition. • union : The union of two regular relations is also a regular relation. • inversion : The inversion of a FST simply switches the input and output labels. • This means that the same FST can be used for both directions of a morphological processor. • composition: If T1 is a FST from I1 to O1 and T2 is a FST from O1 to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2. • We use these properties of FSTs in the creation of the FST for a morphological processor. Natural Language Processing

  28. A FST for Simple English Nominals +N: є +S:# +PL:^s# reg-noun +N: є +SG:# irreg-sg-noun irreg-pl-noun +PL:# +N: є Natural Language Processing

  29. FST for stems • A FST for stems which maps roots to their root-class reg-nounirreg-pl-nounirreg-sg-noun fox g o:e o:e se goose cat sheep sheep dog m o:i u:є s:c e mouse • fox stands for f:f o:o x:x • When these two transducers are composed, we have a FST which maps lexical forms to intermediate forms of words for simple English noun inflections. • Next thing that we should handle is to design the FSTs for orthographic rules, and combine all these transducers. Natural Language Processing

  30. d d o o g g +N ^ s # +PL d o g s Multi-Level Multi-Tape Machines • A frequently use FST idiom, called cascade, is to have the output of one FST read in as the input to a subsequent machine. • So, to handle spelling we use three tapes: • lexical, intermediate and surface • We need one transducer to work between the lexical and intermediate levels, and a second (a bunch of FSTs) to work between intermediate and surface levels to patch up the spelling. lexical intermediate surface Natural Language Processing

  31. Lexical to Intermediate FST Natural Language Processing

  32. Orthographic Rules • We need FSTs to map intermediate level to surface level. • For each spelling rule we will have a FST, and these FSTs run parallel. • Some of English Spelling Rules: • consonant doubling -- 1-letter consonant doubled before ing/ed -- beg/begging • E deletion - Silent e dropped before ing and ed -- make/making • E insertion -- e added after s, z, x, ch, sh before s -- watch/watches • Y replacement -- y changes to ie before s, and to i before ed -- try/tries • K insertion -- verbs ending with vowel+c we add k -- panic/panicked • We represent these rules using two-level morphology rules: • a => b / c__drewrite a as b when it occurs between c and d. Natural Language Processing

  33. FST for E-Insertion Rule E-insertion rule: є => e / {x,s,z}^ __ s# Natural Language Processing

  34. Generating or Parsing with FST Lexicon and Rules Natural Language Processing

  35. Accepting Foxes Natural Language Processing

  36. Intersection • We can intersect all rule FSTs to create a single FST. • Intersection algorithm just takes the Cartesian product of states. • For each state qi of the first machine and qjof the second machine, we create a new state qij • For input symbol a, if the first machine would transition to state qn and the second machine would transition to qm the new machine would transition to qnm. Natural Language Processing

  37. Composition • Cascade can turn out to be somewhat pain. • it is hard to manage all tapes • it fails to take advantage of restricting power of the machines • So, it is better to compile the cascade into a single large machine. • Create a new state (x,y) for every pair of states x є Q1 and y є Q2. The transition function of composition will be defined as follows: δ((x,y),i:o) = (v,z) if there exists c such that δ1(x,i:c) = v and δ2(y,c:o) = z Natural Language Processing

  38. Intersect Rule FSTs lexical tape LEXICON-FST intermediate tape FST1 … FSTn => FSTR = FST1 ^ … ^ FSTn surface tape Natural Language Processing

  39. Compose Lexicon and Rule FSTs lexical tape lexical tape LEXICON-FST => LEXICON-FST o FSTR intermediate tape FSTR = FST1 ^ … ^ FSTn surface level surface tape Natural Language Processing

  40. Porter Stemming • Some applications (some informational retrieval applications) do not the whole morphological processor. • They only need the stem of the word. • A stemming algorithm (Port Stemming algorithm) is a lexicon-free FST. • It is just a cascaded rewrite rules. • Stemming algorithms are efficient but they may introduce errors because they do not use a lexicon. Natural Language Processing

More Related