Mathias Creutz and Krista Lagus Helsinki University of Technology (HUT)

Morpho Challenge in Pascal Challenges WorkshopVenice, 12 April 2006Morfessor in the Morpho Challenge MathiasCreutz and KristaLagus Helsinki University of Technology (HUT) Adaptive Informatics Research Centre

Source: Creutz & Lagus, 2005 tech.rep. Challenge for NLP: too many words • E.g., Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän (know + would + we + INTERR + indeed) • Huge number of word forms, few examples of each • By splitting we get fewer basic units, each with more examples • Important to know the inner structure of words

Solution approaches • Hand-made morphological analyzers (e.g., based on Koskenniemi’s TWOL = two-level morphology) • accurate • labour-intensive construction, commercial, coverage, updating when languages change, addition of new languages • Data-driven methods, preferably minimally supervised (e.g., John Goldsmith’s Linguistica) • adaptive, language-independent • lower accuracy • many existing algorithms assume few morphemes per word, unsuitable for compounds and multiple affixes

Goal: segmentation Morfessor • Learnrepresentations of • the smallest individually meaningful units of language (morphemes) • and their interaction • in an unsupervised and data-driven manner from raw text • making as general and as language-independent assumptions as possible. • Evaluate • against a gold-standard morphological analysis of word forms • integrated in NLP applications (e.g. speech recognition) Hutmegs

believ hop liv mov us e ed es ing Further challenges in morphology learning • Beyond segmentation: allomorphy (“foot – feet, goose – geese”) • Detection of semantic similarity (“sing – sings – singe – singed”) • Learning of paradigms (e.g., John Goldsmith’s Linguistica)

Linguistic evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard) • Hutmegs contains gold standard segmentations obtained by processing the morphological analyses of FinTWOL and CELEX • 1.4 million Finnish word forms (FInTWOL, from Lingsoft Inc.) • Input: ahvenfileerullia (perch filet rolls) • FINTWOL: ahven#filee#rulla N PTV PL • Hutmegs: ahven + filee + rull + i + a • 120 000 English word forms (CELEX, from LDC) • Input: housewives • CELEX: house wife, NNx, P • Hutmegs: house + wive + s • Publicly available, see M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English.

Morfessor models in the Challenge • Morfessor Baseline (2002) • Program code available since 2002 • Provided as a baseline model for the Morpho Challenge • Improves speech recognition; experiments since 2003 • No model of morphotactics • Morfessor Categories ML (2004) • Category-based modeling (HMM) of morphotactics • No speech recognition experiments before this challenge • No public software yet • Morfessor Categories MAP (2005) • More elegant mathematically M1 M2 M3

Avoiding overlearning by controlling model complexity • When using powerful machine learning methods, overlearning is always a problem • Occam’s razor: given two equally accurate theories, choose the one that is less complex • We have used: • Heuristic control affecting the size of the lexicon • Deriving a cost function that incorporates a measure of model size, using • MDL (Minimum Description length) • MAP learning (Maximum A Posteriori) M2 M1 M3

P (M | corpus )  P (M) P (corpus | M) where M = (lexicon, grammar) and therefore = P (lexicon) P (corpus | lexicon) = P () P () letters  morphs  Morfessor Baseline M1 • Originally called the ”Recursive MDL method” • Optimizes roughly: • + MDL based cost function optimizes size of the model • - Morph contextual information not utilized • undersegmentation of frequent strings (“forthepurposeof”) • oversegmentation of rare strings (“in + s + an + e”) • syntactic / morphotactic violations(“s + can”)

Randomly shuffle words Recursive binary splitting words opening openminded openminded reopened reopened conferences reopen minded Morphs mind open re ed Search for the optimal model M1 Convergence of model prob.? yes Done no

Winners Morfessor Baseline: M1 Challenge Results: Comparison to gold standard splitting (F-measures)

P(STM | PRE) P(SUF | SUF) Transition probs P(’over’ | PRE) P(’s’ | SUF) Emission probs # over simpl ific ation s # Morfessor- Categories – ML & MAP M2 M3 • Lexicon / Grammar dualism • Word structure captured by a regular expression: word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model (HMM): • Lexicon: morpheme properties and contextual properties • Morph segmentation is initialized using M1

17259 14029 41 1 136 4 4618 1 1 4 5 1 over s simpl Right perplexity Left perplexity Frequency Length String Morphs ... Morph lexicon M2 M3 Form Morph distributional features

How morph distributional features affect morph categories • Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) • Assume asymmetries between the categories:

How distributional features affect categories (2) • Distribute remaining probability mass proportionally, • e.g., • There is an additional non-morpheme category for cases where none of the proper classes is likely:

14029 136 1 4 over 17259 1 4618 1 s 41 4 1 5 simpl P(STM | PRE) P(SUF | SUF) P(’over’ | PRE) P(’s’ | SUF) ... s # over simpl ation # ific MAP vs. ML optimization Morfessor Categories-ML: M2 Control lexicon size heuristically arg max P(Corpus | Lexicon) Lexicon Morfessor Categories-MAP: M3

Hierarchical structures in lexicon M3 straightforwardness • Maintain the hierarchy of splittings for each word • Ability to code efficiently also common substrings which are not morphemes (e.g. syllables in foreign names) • Bracketed output straightforward ness Suffix straight forward Stem for ward Non-morpheme

Example segmentations M3

Winner Committees Morfessor Categories models: M2 and M3 Morfessor Baseline: M1 Challenge Results: Comparison to gold standard splitting (F-measures)

Morfessor results: closer look M3

Speech recognition results: Finnish Morfessors: M1, M2, M3 Committees Competitors

Speech recognition results: Turkish Morfessors: M1, M2, M3 Committees

Source: Creutz & Lagus, 2005 tech.rep. A reason for differences?

Discussion • This was the first time our Category methods were evaluated in speech recognition, with nice results! • Comparison with Morfessors and challenge participants is not quite fair • Possibilities to extend the M3 model • add word contextual features for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful)

Questions for the Morpho Challenge • How language-general in fact are the methods? • Norwegian, French, German, Arabic, ... • Did we, or can we succeed in inducing ”basic units of meaning”? • Evaluation in other NLP problems: MT, IR, QA, TE, ... • Application of morphs to non-NLP problems? Machine vision, image analysis, video analysis ... • Will there be another Morpho Challenge?

See you in another challenge! • best wishes, Krista (and Sade)

Muistiinpanojani • kuvaa lyhyesti omat menetelmät • pohdi omien menetelmien eroja suhteessa niiden ominaisuuksiin • ole nöyrä, tuo esiin miksi vertailu on epäreilu (aiempi kokemus; oma data; ja puh.tunnistuskin on tuttu sovellus, joten sen ryhmän aiempi tutkimustyö on voinut vaikuttaa menetelmänkehitykseemme epäsuorasti) + pohdintaa meidän menetelmien eroista? + esimerkkisegmentointeja kaikistamme? + Diskussiokamaa Mikon paperista ja meidän paperista + eka tuloskuva on nyt sekava + värit eri tavalla kuin muissa: vaihda värit ja tuplaa, nosta voittajaa paremmin esiin

Discussion • Possibility to extend the model • rudimentary features used for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) • Already now useful in applications • automatic speech recognition (Finnish, Turkish)

Competition 1: Comparing Morfessors

Overview of methods • Machine learning methodology • FEATURES USED:information on • morph contexts (Bernard, Morfessor) • word contexts (Bordag)

Morpho project page krista lagus: http://www.cis.hut.fi/projects/morpho/

http://www.cis.hut.fi/projects/morpho/ Demo 6

Demo 7

Mathias Creutz and Krista Lagus Helsinki University of Technology (HUT)

Mathias Creutz and Krista Lagus Helsinki University of Technology (HUT)

Presentation Transcript

Annikki Mäkelä University of Helsinki

University of Helsinki

Anna-Kaarina Kairamo Helsinki University of Technology Teaching and Learning Development

Mathias Weske HPI at University of Potsdam

Victoria Sinclair University of HelsinkI David Schultz University of Helsinki, FMI,

An Overview of Cryptography Ying Wang-Suorsa Helsinki University of Technology

Helsinki University of Technology Department of Electrical and Communications Engineering

PROJECT REPORT By Dr. Ngo Chi Trung Hanoi University of Science and Technology

MSM-project Helsinki University of Technology

WP5 progress and planning University of Helsinki

24.-25.5.2004 Jarno Vähäniitty Helsinki University of Technology

Helsinki University of Technology (HUT-600) Auditorium Project

WP4 progress and planning University of Helsinki

Helsinki University of Technology Department of Electrical and Communication Engineering

Helsinki University of Technology Department of Electrical and Communication Engineering

Ilkka Korpela University of Helsinki

Helsinki Institute of Physics

Annakaisa Korja University of Helsinki

Victoria Sinclair University of HelsinkI David Schultz University of Helsinki, FMI,

Annakaisa Korja University of Helsinki

Mathias Creutz and Krista Lagus Helsinki University of Technology (HUT)

Ahti Salo and Juuso Liesiö Systems Analysis Laboratory Helsinki University of Technology