1 / 25

Towards unsupervised induction of morphophonological rules

Towards unsupervised induction of morphophonological rules. Erwin Chan University of Pennsylvania Morphochallenge workshop 19 Sept 2007. Goals of unsup morphology induction. Provide analysis of input data 2. Analyzer for unseen data

salena
Télécharger la présentation

Towards unsupervised induction of morphophonological rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards unsupervised inductionof morphophonological rules Erwin Chan University of Pennsylvania Morphochallenge workshop 19 Sept 2007

  2. Goals of unsup morphology induction • Provide analysis of input data 2. Analyzer for unseen data Key task: generalize analysis of input data by inducing phonological characteristics

  3. Example: inducing phonology(English plural nouns) 2. Induce segmentation process.es witness.es match.es hatch.es maid.s fern.s mate.s 3. Induce phonology es: ends in ch or sh s: other characters 4. Apply to novel words bench.es fate.s foe.s wish.es 1. Input corpus processes witnesses matches hatches maids ferns mates

  4. Base-and-transforms model of morphological paradigms Apply transforms to base forms to generate inflections Lexeme 1 Lexeme 2 Lexeme 3 t1 t2 t1 t2 t1 t2 base 1 base 2 base 3 t3 t3 t3 t4 t4 t4 t5 t5 t5

  5. Base forms • Base form serves as lexical entry for all inflections of a lexeme e.g. base of {help, helps, helping, helped} is help • Same fine-grained POS type for all lexemes e.g. “nominative singular” for all nouns

  6. Transforms • Generates inflected form from base • Format: ( A, B ) A, B: simple regular expressions A: characters in base to replace B: characters in inflected to replace

  7. Transform examples Base form eat time time hang Inflected eating times timing hung Transform ( $, ing ) ( $, s ) ( e, ing ) ( *a*, *u* )  non-concat

  8. Comparison to phonological rules • Standard rewrite rule: A  B / C _ D 1. A  B: rewrite operation 2. C _ D: phonological context of application • A transform is an ungeneralized rule A  B / { set of base forms } • Future work: induce phonological rules Learn generalized phonological properties of base forms

  9. Compare with stem-suffix model • Stem-suffix • saves = save + s • saving = sav + ing Drawback: multiple lexical representations • Base-transform • saves = save + ( $,s ) • saving = save + ( e,ing )

  10. Limitations of model • Simple morphotactic structure: • assumes one suffix • a word is either a base form, or inflected from a base form • Does not account for: • agglutination • compounds • prefixing • irregulars, suppletion

  11. Distribution of morphological forms • What information is available in corpora for learning? • Is there structure within the distribution of morphological forms that a learner can exploit? • Examine annotated corpora for several languages

  12. Spanish newswire verbs Sparse data Log(freq) Lemma Inflection

  13. # word types per inflection (Slovene 2.5 M) roughly Zipfian Dist. of inflectional categories

  14. Most frequent inflection (in types) often matches intuitions of what inflection a base form should be Slovene: A.Pos.Nom.Sg.Indef N.Nom.Sg V.Main.Ind.Pres.3.Sg Swedish: A.Pos.Sg.Indef.Nom N.Sg.Indef.Nom V.Inf.Act Spanish: A.Sg N.Sg V.Inf High frequency of base form

  15. Goals of induction algorithm • Select words from corpus to be base forms • Formulate transforms Technique: take advantage of high type frequency of base inflectional category

  16. Start state End state Transforms = {($,s), ($,’s), …} Transforms = {} Base forms base Inflected forms inflected unmodeled unmodeled

  17. Greedy algorithm At each iteration, • construct potential transforms • add the transform(s) that accounts for most data

  18. Sources of words for transform Current grammar New transform base base inflected inflected unmodeled

  19. WSJ: Most freq. suffixes (1st iteration)

  20. WSJ: potential transforms (1st iteration)

  21. Table for ( $, s ) Base greater: 3750, Inflected greater: 817 Choose ( $, s ) instead of ( s, $ ) Choose direction of transform

  22. WSJ: sequence of transforms added

  23. Morphochallenge English data • High number of word types ( ~250,000 ) leads to spurious transforms • ( $, a ) (music, musica) (naam,naama) (nucci,nuccia) (retin,retina) (mash,masha) (gab,gaba) • ( $, o ) (rutili,rutilio) (lazar,lazaro) (vern,verno) (berk,berko) (rikky,rikkyo) (economic,economico)

  24. Summary • Base-and-transforms model of morphological paradigms • First step towards learning morphophonological rules • More linguistically satisfying than stem-and-suffix • Algorithm: • learn inventory of base forms • learn transforms (base-specific rules) • Exploits high freq. of base inflectional category

  25. More slides available… • Longer version of this presentation • base forms simplify POS induction • Different system: transforms in parallel • Slovene, Spanish

More Related