1 / 17

Induction of a Simple Morphology for Highly-Inflecting Languages

nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Induction of a Simple Morphology for Highly-Inflecting Languages. {Mathias.Creutz, Krista.Lagus}@hut.fi

lauren
Télécharger la présentation

Induction of a Simple Morphology for Highly-Inflecting Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. nyky+ratkaisu+i+sta+mme • kahvi+n+juo+ja+lle+kin • tietä+isi+mme+kö+hän • open+mind+ed+ness • un+believ+able Induction of a Simple Morphologyfor Highly-Inflecting Languages {Mathias.Creutz, Krista.Lagus}@hut.fi Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL-2004. Barcelona, 26 July 2004

  2. Goals and challenges • Learnrepresentations of • the smallest meaningful units of language (morphemes) • and their interaction • in an unsupervised manner from raw text • making as general and language-independent assumptions as possible. • Evaluate • against a given gold-standard morphological analysis of word forms • first step: learn and evaluate a morpheme segmentation of word forms • integrated in NLP applications (speech recognition) Mathias Creutz

  3. Focus: Agglutinative morphology • Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin • (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme • (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän • (know + would + we + INTERR + indeed) • Huge number of different possible word forms • Important to know the inner structure of words in NLP • The number of morphemes per word varies much Mathias Creutz

  4. a =tä b =ssä g =pala d =peli e =on q =tuhat z =a Learning from data a b g d b e q g z tä ssä pala peli ssä on tuhat pala a 1. MDL model (Creutz & Lagus, 2002)(inspired by work of, e.g., J. Goldsmith) ”Invent” a set distinct strings = morphs Aim at the most concise represent- ation possible Morph lexicon Pick morphs from the lexicon and place them in a sequence Corpus / word list Mathias Creutz

  5. a =tä b =ssä g =pala d =peli e =on q =tuhat z =a a b g d b e q g z tä ssä pala peli ssä on tuhat pala a 2. Probabilistic formulation (Creutz, 2003)(inspired by work of, e.g., M. R. Brent and M. G. Snover) Length prior ”Invent” a set distinct strings = morphs Morph lexicon Frequency prior Pick morphs from the lexicon and place them in a sequence Corpus / word list Mathias Creutz

  6. Reflections on solutions 1 and 2 • ”Dumb” text compression algorithms • Common substrings of words appear as one segment, even when compositional structure, e.g.,: • keskustelussa (keskustel + u + ssa; ”discuss+ion in”) • biggest (bigg + est) • Rare substrings of words are split, even when no compositional structure, e.g., • a + den + auer (Adenauer; German politician) • in + s + an + e (in + sane) • Too weak structural constraints, e.g., suffixes recognized in the beginning of words: • s + can (scan) Mathias Creutz

  7. p(STM | PRE) p(SUF | SUF) p(’nyky’ | PRE) p(’mme’ | SUF) # nyky ratkaisu i sta mme # 3. Category-learning probabilistic model • Word structure captured by a regular expression: • word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model: Transition probs Emission probs Mathias Creutz

  8. 1. Start with an existing baseline morph segmentation (Creutz, 2003): nyky + rat + kaisu + ista + mme Category algorithm 2. Initialize category membership probs for each morph, e.g., p(PRE | ’nyky’). Assume asymmetries between the categories: Mathias Creutz

  9. Initialization of category membership probs • Introduce a noise category for cases where none of the proper classes is likely: • Distribute remaining probability mass proportionally, e.g., Mathias Creutz

  10. 4. Split morphs that consist of other known morphs. Then EM: nyky + rat + kaisu + i+sta + mme 5. Join noise morphs with their neighbours. Then EM: nyky + ratkaisu + i+sta + mme Category algorithm (continued) 1. Start with an existing baseline morph segmentation: nyky + rat + kaisu + ista + mme 2. Initialize category membership probs for each morph. 3. Tag morphs as prefix, stem, suffix, ornoise. Then run EM on taggings: nyky + rat + kaisu + ista + mme Mathias Creutz

  11. believ hop liv mov us e ed es ing Experiments • Algorithms • Baseline model (Bayesian formulation) • Category-Learning model • Goldsmith’s ”Linguistica” (MDL formul.) • Data • Finnish data sets (CSC + STT) • 10 000 words, 50 000 words, 250 000 words, 16 million words • English data sets (Brown corpus) • 10 000 words, 50 000 words, 250 000 words Mathias Creutz

  12. ”Gold standard” used in evaluation • Morpheme segmentation obtained for Finnish and English words • by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.) • Some ”fuzzy morpheme boundaries” allowed • mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g., • Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor) • invite + s or invit + es; invite or invit + e (cf. invit + ing) • Compute precision and recall of correctly discovered morpheme boundaries Mathias Creutz

  13. Results (evaluated against the gold-standard) Baseline 16M 10k Categories 10k 250k 250k 16M Categories 250k 10k 250k Linguistica 10k 250k Linguistica 10k 250k 10k Baseline Mathias Creutz

  14. Discussion • The Category algorithm • overcomes many of the shortcomings of the Baseline algorithm • excessive or too little segmentation • suffixes in beginning of words • generalizes more than Linguistica, e.g., • allus+ ion + s (Categories) vs. allusions (Linguistica) • Dem+i (Categories) vs. Demi (Linguistica) • invents its own solutions • aihe+e+sta vs. aihe+i+sta (”about [the] topic/-s”) • phrase, phrase+s, phrase+d Mathias Creutz

  15. Future directions • The Category algorithm could be expressed more elegantly • not as a post-processing procedure making use of a baseline segmentation • Segmentation into morphs is useful • e.g., n-gram language modeling in speech recognition • Detection of allomorphy, i.e., segmentation into morphemes would be even more useful • e.g., information retrieval (?) Mathias Creutz

  16. Public demo • A demo of the baseline and category-learning algorithm is available on the Internet at http://www.cis.hut.fi/projects/morpho/. • Test it on your own Finnish or English input! Mathias Creutz

  17. Randomly shuffle words Recursive binary splitting words opening openminded openminded reopened reopened conferences reopen minded Morphs mind open re ed Search for the optimal segmentation of the words in a corpus Convergence of descr. length? yes Done no Mathias Creutz

More Related