1 / 32

Morphology and Finite-State Transducers

Morphology and Finite-State Transducers. by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin. Contents. Morphology morphemes, inflection and derivation, allomporphs Morphological Parsing finite-state automata, two-level morphology Finite-State Transducers

dom
Télécharger la présentation

Morphology and Finite-State Transducers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

  2. Contents • Morphology • morphemes, inflection and derivation, allomporphs • Morphological Parsing • finite-state automata, two-level morphology • Finite-State Transducers • rules, combination of FSTs, lexicon-free FSTs • Human Morphological Processing • Exercise

  3. Morphology • Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. • e.g. talo + ssa + ni + kin • Two broad classes of morphemes, stems and affixes: • the stem is the ”main morpheme” of the word, supplying the main meaning, e.g. talo in talo+ssa+ni+kin

  4. Affixes • Affixes add ”additional” meanings. • Concatenative morphology uses the following types of affixes: • prefixes, e.g. epä- in epä+olennainen • suffixes, e.g. –ssa in talo+ssa • circumfixes, e.g. German ge- -t in ge+sag+t ([have] said)

  5. Non-concatenative Morphology • In non-concatenative morphology the stem morpheme is split up. The following types of affixes are used: • infixes, e.g. Californian Jurok, sepolah (field), se+ge+polah (fields) • transfixes, e.g. Hebrew, l+a+m+a+d (he studied), l+i+m+e+d (he taught), l+u+m+a+d (he was taught) • This type of non-concatenative morphology is called templatic or root-and-pattern morphology.

  6. Inflection and Derivation • There are two broad classes of ways to form words from morphemes: inflection and derivation.

  7. Inflection • Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. • talo (singular), talo+t (plural) • Inflection is productive. • talo, talo+t vs. auto, auto+t vs. metsä, metsä+t • The meaning of the resulting word is easily predictable.

  8. Derivation • Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. • e.g. järki, järje+st+ää, järje+st+ö, järje+st+ell+ä, järje+st+el+mä, järje+st+el+mä+lli+nen, järje+st+el+mä+lli+syys • Not always productive. • järki, järje+st+ää vs. metsä, metsä+st+ää vs. talo, talo+st+aa?

  9. Allomorphs • A group of allomorphs make up one morpheme class. An allomorph is a special variant of a morpheme. • e.g. Finnish illative ending: +<vowel_lengthening>n, +h<vowel>n, +seen, +siin  talo+on, metsä+än, talo+i+hin, huonee+seen, huone+i+siin • e.g. Finnish stem variation: käsi, käde+n, kät+tä, käte+en

  10. Why Allomorphs? • Phonological constraints • e.g. vowel harmony, talo+ssavs. metsä+ssä • Morphological paradigms • e.g. käsi, käde+n vs.kasi, kasi+n, Swedish leta, leta+de vs. heta, het+te • Irregularities • e.g. cat, cat+s vs. goose, geese • Orthographic constraints, i.e. spelling rules • e.g. cat, cat+s vs. city, citi+es

  11. Morphological Parsing • Parsing means taking an input and producing some sort of structure for it. • Morphological parsing means breaking down a word form into its constituent morphemes. • e.g. talossa  talo +ssa • Mapping of a word form to its baseform is called stemming. • e.g. talossa  talo

  12. Finite-State Morphological Parsing • In order to build a parser we need the following: • a lexicon containing the stems and affixes, • morphotactics, i.e. the model of morpheme ordering, e.g. talo+ssa+ni instead of talo+ni+ssa, • a set of rules (orthographic, etc.), i.e. the model of changes that occur in a word, usually when two morphemes combine, e.g. city + s  cities.

  13. q1 q2 q3 Finite-State Automaton for Inflection of English Verbs irreg-past-verb-form reg-verb-stem preterite (-ed) q0 past-participle (-ed) reg-verb-stem progressive (-ing) irreg-verb-stem 3-singular (-s)

  14. q1 q2 q3 Finite-State Automaton for Inflection of the Verbs ’talk’, ’test’ and ’sing’ u a n s s t g e l k e a t d q0 e s t e d t a s l g k n i i g s n

  15. Two-Level Morphology • Two-level morphology represents a word as a correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word. Lexical s i n g +V +PROG Surface s i n g i n g

  16. Finite-State Transducer • A transducer maps between one set of symbols and another; a finite state transducer does this via a finite automaton. • Where an FSA accepts a language stated over a finite alphabet of single symbols, e.g. ={a, b, c, ...}, an FST accepts a language stated over pairs of symbols, e.g. ={a:a, b:b, a:c, a:, :, ...} • In two-level morphology, we call pairs like a:adefault pairs, and refer to them by a single symbol a. • An FST can be seen as a recognizer, generator, translator or a set relator.

  17. q3 Finite-State Transducer for Inflection of the Verbs ’talk’, ’test’ and ’sing’ n g i:u +V: n s i:a g +V: s t +PSTPCP: e +V: +PRET: l k +PRET:e a t :d q0 e s t +PSTPCP:e :d t a s l k +V: :g +PROG:i :n i g n +3SG:s

  18. Examples

  19. Useful FST Operations • Inversion: Switch input and output labels. • e.g. (T)={a:b, c:d}  (inv(T))={b:a, d:c} • Intersection: Only sequences of pairs accepted by both transducerT1and transducerT2 are accepted by transducer T1^T2. • Composition: The output of transducer T1 serves as input to T2. This is marked as T1ºT2 or T2(T1).

  20. Spelling Rules and FSTs

  21. Three levels • Add an intermediate level between the lexical and surface levels Lexical k i s s +V +3SG Intermediate k i s s ^ s # Surface k i s s e s

  22. FST for the E-insertion Rule q5 ^: other other z, s, x z, s, x # ^: s z, s, x ^: :e q0 q1 q2 q3 s q4 z, x #, other #, other #

  23. Lexical k i s s +V +3SG Intermediate k i s s ^ s # Surface k i s s e s Combination of FSTs (1) Lexicon-FST ... Rule1-FST RuleN-FST

  24. Lexical k i s s +V +3SG Surface k i s s e s Combination of FSTs (2) Lexicon-FST Intermediate k i s s ^ s # ... Rule1-FST RuleN-FST Intersect

  25. Lexical k i s s +V +3SG Surface k i s s e s Combination of FSTs (3) Compose Lexicon-FST Intermediate k i s s ^ s # ... Rule1-FST RuleN-FST Intersect

  26. Intersection and Composition • For each state qi in transducer T1 and state qj in transducer T2, create a new state qij. • Intersection: For any pair a:b, if T1 transitions from qi to qn, and T2 transitions from qj to qm, T1^T2 transitions from qij to qnm. • Composition: If T1 transitions from qi to qn with the pair a:b, and T2 transitions from qj to qm with the pair b:c, then T1ºT2 transitions from qij to qnm with the pair a:c.

  27. Lexicon-Free FSTs • Used in information-retrieval • E.g. the Porter algorithm, which is based on a series of simple cascaded rewrite rules: • ATIONAL  ATE (relational  relate) • ING   if stem contains vowel (motoring  motor) • Errors occur: • organization  organ, doing  doe, university  universe

  28. Human Morphological Processing (1) • How are multi-morphemic words represented in the minds of human speakers? • full-listing hypothesis vs. minimum redundancy hypothesis • Experiments: • Stanners et al. 1979: a word is recognized faster if it has been seen before (priming): lifting  lift, burned  burn, selective / select, i.e. different representations for inflection and derivation. • Marsen-Wilson et al. 1994: spoken derived words can prime their stems, but only if their meaning is close: government  govern, department / depart

  29. Human Morphological Processing (2) • Speech errors: Speakers mix up the order of words... • e.g. if you break it, it’ll drop • ... and also attach affixes to the wrong stems: • e.g. it’s not only we who have screw looses (for ”screws loose”) • e.g. easy enoughly (for ”easily enough”)

  30. Excercise (1/3) • Your task is to create a finite-state transducer that can analyze thefollowing Finnish word forms:

  31. Exercise (2/3) • The morphological tags have the following meaning: +NOM = nominative; +ILL = illative; +POS1PL = possessive, 1stperson plural. • Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky & Martin. Create threeseparate finite-state transducers that you finally combine into one: • a)Create a transducer that operates between the intermediate and surfacelevel. This transducer handles the vowel lengthening that is necessary forthe illative form: talo +ILL  talo|on vs. metsä +ILL  metsä|än.

  32. Excercise (3/3) • b)Create a transducer that operates between the intermediate and surfacelevel. This transducer handles the deletion of nin front of a possessiveending: talo + mme  talo|mme vs.talo|on + mme  talo|o|mme. • c)Create a transducer that operates between the lexical and theintermediate level. This transducer maps morphological tags onto endings. • d)Combine all the transducers into one. • Present your transducers as graphs or tables (cf. Fig. 3.15 in Jurafsky & Martin)

More Related