Towards unsupervised induction of morphophonological rules

Towards unsupervised inductionof morphophonological rules Erwin Chan University of Pennsylvania Morphochallenge workshop 19 Sept 2007

Goals of unsup morphology induction • Provide analysis of input data 2. Analyzer for unseen data Key task: generalize analysis of input data by inducing phonological characteristics

Example: inducing phonology(English plural nouns) 2. Induce segmentation process.es witness.es match.es hatch.es maid.s fern.s mate.s 3. Induce phonology es: ends in ch or sh s: other characters 4. Apply to novel words bench.es fate.s foe.s wish.es 1. Input corpus processes witnesses matches hatches maids ferns mates

Base-and-transforms model of morphological paradigms Apply transforms to base forms to generate inflections Lexeme 1 Lexeme 2 Lexeme 3 t1 t2 t1 t2 t1 t2 base 1 base 2 base 3 t3 t3 t3 t4 t4 t4 t5 t5 t5

Base forms • Base form serves as lexical entry for all inflections of a lexeme e.g. base of {help, helps, helping, helped} is help • Same fine-grained POS type for all lexemes e.g. “nominative singular” for all nouns

Transforms • Generates inflected form from base • Format: ( A, B ) A, B: simple regular expressions A: characters in base to replace B: characters in inflected to replace

Transform examples Base form eat time time hang Inflected eating times timing hung Transform ( $, ing ) ( $, s ) ( e, ing ) ( *a*, *u* )  non-concat

Comparison to phonological rules • Standard rewrite rule: A  B / C _ D 1. A  B: rewrite operation 2. C _ D: phonological context of application • A transform is an ungeneralized rule A  B / { set of base forms } • Future work: induce phonological rules Learn generalized phonological properties of base forms

Compare with stem-suffix model • Stem-suffix • saves = save + s • saving = sav + ing Drawback: multiple lexical representations • Base-transform • saves = save + ( $,s ) • saving = save + ( e,ing )

Limitations of model • Simple morphotactic structure: • assumes one suffix • a word is either a base form, or inflected from a base form • Does not account for: • agglutination • compounds • prefixing • irregulars, suppletion

Distribution of morphological forms • What information is available in corpora for learning? • Is there structure within the distribution of morphological forms that a learner can exploit? • Examine annotated corpora for several languages

Spanish newswire verbs Sparse data Log(freq) Lemma Inflection

# word types per inflection (Slovene 2.5 M) roughly Zipfian Dist. of inflectional categories

Most frequent inflection (in types) often matches intuitions of what inflection a base form should be Slovene: A.Pos.Nom.Sg.Indef N.Nom.Sg V.Main.Ind.Pres.3.Sg Swedish: A.Pos.Sg.Indef.Nom N.Sg.Indef.Nom V.Inf.Act Spanish: A.Sg N.Sg V.Inf High frequency of base form

Goals of induction algorithm • Select words from corpus to be base forms • Formulate transforms Technique: take advantage of high type frequency of base inflectional category

Start state End state Transforms = {($,s), ($,’s), …} Transforms = {} Base forms base Inflected forms inflected unmodeled unmodeled

Greedy algorithm At each iteration, • construct potential transforms • add the transform(s) that accounts for most data

Sources of words for transform Current grammar New transform base base inflected inflected unmodeled

WSJ: Most freq. suffixes (1st iteration)

WSJ: potential transforms (1st iteration)

Table for ( $, s ) Base greater: 3750, Inflected greater: 817 Choose ( $, s ) instead of ( s, $ ) Choose direction of transform

WSJ: sequence of transforms added

Morphochallenge English data • High number of word types ( ~250,000 ) leads to spurious transforms • ( $, a ) (music, musica) (naam,naama) (nucci,nuccia) (retin,retina) (mash,masha) (gab,gaba) • ( $, o ) (rutili,rutilio) (lazar,lazaro) (vern,verno) (berk,berko) (rikky,rikkyo) (economic,economico)

Summary • Base-and-transforms model of morphological paradigms • First step towards learning morphophonological rules • More linguistically satisfying than stem-and-suffix • Algorithm: • learn inventory of base forms • learn transforms (base-specific rules) • Exploits high freq. of base inflectional category

More slides available… • Longer version of this presentation • base forms simplify POS induction • Different system: transforms in parallel • Slovene, Spanish

Towards unsupervised induction of morphophonological rules

Towards unsupervised induction of morphophonological rules

Presentation Transcript

Towards unsupervised induction of morphophonological rules

Unsupervised Discovery of Morphemes

Towards Interactive and Automatic Refinement of Translation Rules

Structural Induction: towards Automatic Ontology Elicitation

Structural Induction, towards Automatic Ontology Elicitation

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

A Survey of Unsupervised Grammar Induction

Unsupervised learning

Unsupervised Classification

Unsupervised Automation of Photographic Composition Rules

Towards Eliminating Conditional Rewrite Rules

Towards Interactive and Automatic Refinement of Translation Rules

Unsupervised Analysis

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules

Unsupervised learning

Unsupervised Learning

Unsupervised Learning

Induction of Node Label Controlled Graph Grammar Rules

Approaching Rules Induction: CN2 Algorithm in Categorizing of Biodiversity