Morphology and Finite-State Transducers

Morphology and Finite-State Transducers by Mathias Creutz 31 October 2001 Chapter 3, Jurafsky & Martin

Contents • Morphology • morphemes, inflection and derivation, allomporphs • Morphological Parsing • finite-state automata, two-level morphology • Finite-State Transducers • rules, combination of FSTs, lexicon-free FSTs • Human Morphological Processing • Exercise

Morphology • Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. • e.g. talo + ssa + ni + kin • Two broad classes of morphemes, stems and affixes: • the stem is the ”main morpheme” of the word, supplying the main meaning, e.g. talo in talo+ssa+ni+kin

Affixes • Affixes add ”additional” meanings. • Concatenative morphology uses the following types of affixes: • prefixes, e.g. epä- in epä+olennainen • suffixes, e.g. –ssa in talo+ssa • circumfixes, e.g. German ge- -t in ge+sag+t ([have] said)

Non-concatenative Morphology • In non-concatenative morphology the stem morpheme is split up. The following types of affixes are used: • infixes, e.g. Californian Jurok, sepolah (field), se+ge+polah (fields) • transfixes, e.g. Hebrew, l+a+m+a+d (he studied), l+i+m+e+d (he taught), l+u+m+a+d (he was taught) • This type of non-concatenative morphology is called templatic or root-and-pattern morphology.

Inflection and Derivation • There are two broad classes of ways to form words from morphemes: inflection and derivation.

Inflection • Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function, e.g. plural of nouns. • talo (singular), talo+t (plural) • Inflection is productive. • talo, talo+t vs. auto, auto+t vs. metsä, metsä+t • The meaning of the resulting word is easily predictable.

Derivation • Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. • e.g. järki, järje+st+ää, järje+st+ö, järje+st+ell+ä, järje+st+el+mä, järje+st+el+mä+lli+nen, järje+st+el+mä+lli+syys • Not always productive. • järki, järje+st+ää vs. metsä, metsä+st+ää vs. talo, talo+st+aa?

Allomorphs • A group of allomorphs make up one morpheme class. An allomorph is a special variant of a morpheme. • e.g. Finnish illative ending: +<vowel_lengthening>n, +h<vowel>n, +seen, +siin  talo+on, metsä+än, talo+i+hin, huonee+seen, huone+i+siin • e.g. Finnish stem variation: käsi, käde+n, kät+tä, käte+en

Why Allomorphs? • Phonological constraints • e.g. vowel harmony, talo+ssavs. metsä+ssä • Morphological paradigms • e.g. käsi, käde+n vs.kasi, kasi+n, Swedish leta, leta+de vs. heta, het+te • Irregularities • e.g. cat, cat+s vs. goose, geese • Orthographic constraints, i.e. spelling rules • e.g. cat, cat+s vs. city, citi+es

Morphological Parsing • Parsing means taking an input and producing some sort of structure for it. • Morphological parsing means breaking down a word form into its constituent morphemes. • e.g. talossa  talo +ssa • Mapping of a word form to its baseform is called stemming. • e.g. talossa  talo

Finite-State Morphological Parsing • In order to build a parser we need the following: • a lexicon containing the stems and affixes, • morphotactics, i.e. the model of morpheme ordering, e.g. talo+ssa+ni instead of talo+ni+ssa, • a set of rules (orthographic, etc.), i.e. the model of changes that occur in a word, usually when two morphemes combine, e.g. city + s  cities.

q1 q2 q3 Finite-State Automaton for Inflection of English Verbs irreg-past-verb-form reg-verb-stem preterite (-ed) q0 past-participle (-ed) reg-verb-stem progressive (-ing) irreg-verb-stem 3-singular (-s)

q1 q2 q3 Finite-State Automaton for Inflection of the Verbs ’talk’, ’test’ and ’sing’ u a n s s t g e l k e a t d q0 e s t e d t a s l g k n i i g s n

Two-Level Morphology • Two-level morphology represents a word as a correspondence between a lexical level, which represents a simple concatenation of morphemes making up a word, and the surface level, which represents the actual spelling of the final word. Lexical s i n g +V +PROG Surface s i n g i n g

Finite-State Transducer • A transducer maps between one set of symbols and another; a finite state transducer does this via a finite automaton. • Where an FSA accepts a language stated over a finite alphabet of single symbols, e.g. ={a, b, c, ...}, an FST accepts a language stated over pairs of symbols, e.g. ={a:a, b:b, a:c, a:, :, ...} • In two-level morphology, we call pairs like a:adefault pairs, and refer to them by a single symbol a. • An FST can be seen as a recognizer, generator, translator or a set relator.

q3 Finite-State Transducer for Inflection of the Verbs ’talk’, ’test’ and ’sing’ n g i:u +V: n s i:a g +V: s t +PSTPCP: e +V: +PRET: l k +PRET:e a t :d q0 e s t +PSTPCP:e :d t a s l k +V: :g +PROG:i :n i g n +3SG:s

Examples

Useful FST Operations • Inversion: Switch input and output labels. • e.g. (T)={a:b, c:d}  (inv(T))={b:a, d:c} • Intersection: Only sequences of pairs accepted by both transducerT1and transducerT2 are accepted by transducer T1^T2. • Composition: The output of transducer T1 serves as input to T2. This is marked as T1ºT2 or T2(T1).

Spelling Rules and FSTs

Three levels • Add an intermediate level between the lexical and surface levels Lexical k i s s +V +3SG Intermediate k i s s ^ s # Surface k i s s e s

FST for the E-insertion Rule q5 ^: other other z, s, x z, s, x # ^: s z, s, x ^: :e q0 q1 q2 q3 s q4 z, x #, other #, other #

Lexical k i s s +V +3SG Intermediate k i s s ^ s # Surface k i s s e s Combination of FSTs (1) Lexicon-FST ... Rule1-FST RuleN-FST

Lexical k i s s +V +3SG Surface k i s s e s Combination of FSTs (2) Lexicon-FST Intermediate k i s s ^ s # ... Rule1-FST RuleN-FST Intersect

Lexical k i s s +V +3SG Surface k i s s e s Combination of FSTs (3) Compose Lexicon-FST Intermediate k i s s ^ s # ... Rule1-FST RuleN-FST Intersect

Intersection and Composition • For each state qi in transducer T1 and state qj in transducer T2, create a new state qij. • Intersection: For any pair a:b, if T1 transitions from qi to qn, and T2 transitions from qj to qm, T1^T2 transitions from qij to qnm. • Composition: If T1 transitions from qi to qn with the pair a:b, and T2 transitions from qj to qm with the pair b:c, then T1ºT2 transitions from qij to qnm with the pair a:c.

Lexicon-Free FSTs • Used in information-retrieval • E.g. the Porter algorithm, which is based on a series of simple cascaded rewrite rules: • ATIONAL  ATE (relational  relate) • ING   if stem contains vowel (motoring  motor) • Errors occur: • organization  organ, doing  doe, university  universe

Human Morphological Processing (1) • How are multi-morphemic words represented in the minds of human speakers? • full-listing hypothesis vs. minimum redundancy hypothesis • Experiments: • Stanners et al. 1979: a word is recognized faster if it has been seen before (priming): lifting  lift, burned  burn, selective / select, i.e. different representations for inflection and derivation. • Marsen-Wilson et al. 1994: spoken derived words can prime their stems, but only if their meaning is close: government  govern, department / depart

Human Morphological Processing (2) • Speech errors: Speakers mix up the order of words... • e.g. if you break it, it’ll drop • ... and also attach affixes to the wrong stems: • e.g. it’s not only we who have screw looses (for ”screws loose”) • e.g. easy enoughly (for ”easily enough”)

Excercise (1/3) • Your task is to create a finite-state transducer that can analyze thefollowing Finnish word forms:

Exercise (2/3) • The morphological tags have the following meaning: +NOM = nominative; +ILL = illative; +POS1PL = possessive, 1stperson plural. • Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky & Martin. Create threeseparate finite-state transducers that you finally combine into one: • a)Create a transducer that operates between the intermediate and surfacelevel. This transducer handles the vowel lengthening that is necessary forthe illative form: talo +ILL  talo|on vs. metsä +ILL  metsä|än.

Excercise (3/3) • b)Create a transducer that operates between the intermediate and surfacelevel. This transducer handles the deletion of nin front of a possessiveending: talo + mme  talo|mme vs.talo|on + mme  talo|o|mme. • c)Create a transducer that operates between the lexical and theintermediate level. This transducer maps morphological tags onto endings. • d)Combine all the transducers into one. • Present your transducers as graphs or tables (cf. Fig. 3.15 in Jurafsky & Martin)

Morphology and Finite-State Transducers