Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006

Unsupervised Segmentation of Words into MorphemesMorpho Challenge Workshop 2006 Mikko Kurimo, Mathias Creutz, Krista Lagus

Opening – Welcomes • Welcome to the Morphochallenge workshop, everybody! • challenge participants • workshop speakers • other PASCAL researchers • others interested in the topic

Motivation • To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. • Get basic vocabulary units suitable for different tasks: • Speech and text understanding • Machine translation • Information retrieval • Statistical language modelling • Rule based systems can split: read + ing, but have difficulties for complicated words and languages

Workshop 12 April, final timetable • 0900 Opening • 0910 Introduction and evaluation report • 0950 Invited talk by Richard Sproat • 1050 Break • 1120 Morfessor baseline by Krista Lagus • 1150 Competitors presentations • 1230 Lunch • 1400 Competitors (contd.) • 1500 Discussion • 1530 Conclusion

Morning session • 09:10 Mikko Kurimo • Introduction and Evaluation report • 09:50 Prof. Richard Sproat (Invited Talk) • University of Illinois at Urbana-Champaign • ”Computational Morphology and its Implications for the Theoretical Morphology” • 10:50 – 11:20 Coffee break

Noon session • 11:20 Krista Lagus: "Morfessor in MorphoChallenge" • 11:50 Delphine Bernhard: "Morphological segmentation for the automatic acquisition of semantic relationships in the context of MorphoChallenge 2005" • 12:10 Stefan Bordag: "Two-step approach to unsupervised morpheme segmentation" • 12:30 – 14:00 Lunch

Afternoon session • 14:00 Lars Johnsen: • "Learning morphology on tokens" • 14:20 Samarth Keshava and Emily Pitler: • "Reports - Quick and Simple Unsupervised Learning of Morphemes" • 14:40 Eric Atwell (Mikko Kurimo): • "Combinatory Hybrid Elementary Analysis of Text" • 15:00 Discussion • 15:30 Conclusion

Discussion topics for afternoon • New ways to evaluate the obtained units ? • New evaluation languages: German, Norwegian, French, Estonian, Arabic,..? • Other application evaluations: SLU, IR, MT,..? • New organizer partners ? • MorphoChallenge2 ? • Journal special issue ? • 2nd Morpho Challenge workshop ? • ?

Opening - Thanks • Thanks to all who made Morpho Challenge possible! • PASCAL network, coordinators, challenge program organizers • Morpho Challenge organizing committee • Morpho Challenge program committee • Morpho Challenge participants • Morpho Challenge evaluation team • Challenge workshop organizers

Let’s start. It is my pleasure to welcome the first speaker, who is...

Morpho Challenge – Introduction and evaluation report Mikko Kurimo, Mathias Creutz, Matti Varjokallio (Helsinki, FI) Ebru Arisoy, Murat Saraclar (Istanbul, TR)

Contents • Motivation • Call for participation • Rules • Datasets • Participants • Results of competition 1, word segmentation • Results of competition 2, language modeling • Conclusion

Motivation • To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. • Get basic vocabulary units suitable for different tasks: • Speech and text understanding • Machine translation • Information retrieval • Statistical language modelling

Motivation • The scientific goals of this challenge are: • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages • To advance machine learning methodology

Call for participation • Part of the EU Network of Excellence PASCAL’s Challenge Program • Participation is open to all and free of charge • Word sets are provided for three languages: Finnish, English, and Turkish • Implement an unsupervised algorithm that segments the words of each language! • No language-specific tweaking parameters, please • Write a paper that describes your algorithm

Rules • Segmented words are submitted to the organizers • Two different evaluations are made • Competition 1: Comparison to a linguisticmorpheme segmentation "gold standard“ • Competition 2: Speech recognition experiments, where statistical n-gram language models utilize the morphemes instead of entire words.

Datasets • Word lists are downloadable at our home page • Each word in the list is preceded by its frequency • Finnish: newspapers, books, newswires: 1.6/32M • Turkish: web, newspapers, sports news: 0.6/17M • English: Gutenberg, Gigaword, Brown: 170k/24M • Small gold standard sample in each language

Participants • A1 Choudri and Dang, Univ. Leeds, UK • A2 a,b, Bernhard, TIMC-IMAG, F • A3 'A.A.‘ Ahmad and Allendes, Univ. Leeds, UK • A4 ‘comb’,’lsv’, Bordag, Univ. Leipzig, D • A5 Rehman and Hussain, Univ. Leeds, UK • A6 'RePortS‘, Pitler and Keshava, Univ. Yale, USA • A7 Bonnier, Univ. Leeds, UK • A8 Kitching and Malleson, Univ. Leeds, UK • A9 'Pacman‘, Manley and Williamson, Univ. Leeds, UK • A10 Johnsen, Univ. Bergen, NO • A11 'Swordfish‘, Jordan, Healy and Keselj, Univ. Dalhousie, CA • A12 'Cheat‘, Atwell and Roberts, Univ. Leeds, UK • M1-3 Morfessor, Categories-ML, MAP, Helsinki Univ. Tech, FI

Competition 1: Word segmentation • Two samples : boule_vard , cup_bearer_s‘ • Gold standard: boulevard , cup_bear_er_s_‘ • 2 correct hits (H), 1 insertion (I), 2 deletions (D) • Precision = H / (H + I) = 2 / (2 + 1) = 0.67 • Recall = H / (H + D) = 2 / (2 + 2) = 0.50 • F-Measure = harmonic mean of precision and recall = 2H / (2H + I + D) = 4 / (4 + 1 + 2) = 0.57 • A secret (random)10% subset of words evaluated • Morfessor Baseline: 54.2% FI, 51.3% TR, 66.0 EN

Results: F-measure in Finnish data

F-measure with reference algorithms

F-measure in Turkish data

F-measure in English data

F-measure, the 3 languages task

...with reference algorithms

Competition 2: Language modeling • A statistical N-gram LM trained for the obtained morphemes using a large text corpus • Growing N-gram model for Finnish by HUT tools • 4-gram model for Turkish using SRILM • Free lexicon size (40´000 – 700´000) • ~10M N-grams (Finnish) or 50-70M bytes (Turkish)

Evaluation by speech recognition • Realistic benchmark application: Continuous reading of large-vocabulary texts (books and news) • Letter error rate LER% = (sub + ins + del) / letters • Baseline systems using LMs of Morfessor’s segments • Finnish recognizer made at HUT (HUT tools): speaker-dep., running speed 10-15 xRT, baseline 1.31% LER • Turkish made at Bogazici Univ. (HTK and AT&T tools): speaker-indep., running 2-3 xRT, baseline 13.7% LER

Speech recognition letter error rate (LER)

LER for reference algorithms

LER for grammatic rules and words, too

Update for Turkish results NEW

Conclusion • The scientific goals of this challenge are: • To learn of the phenomena underlying word construction in natural languages • To discover approaches suitable for a wide range of languages • To advance machine learning methodology

Conclusion • 14 different unsupervised segmentation algorithms • 12 participating research groups • Evaluations for 3 languages • Full report and papers in the proceedings • Website: http://www.cis.hut.fi/morphochallenge2005

Acknowledgments • Text and speech data providers in all languages! • Finnish and Turkish evaluation teams • Funding from PASCAL, Finnish Academy, Lang. Tech. Grad school, HUT, and Bogazici Univ. • LM and ASR tools in HUT, SRI, and AT&T • Competition participants!

The second speaker today : Professor Richard Sproat, University of Illinois at Urbana-Champaign:”Computational Morphology and its Implications for the Theoretical Morphology”

Richard Sproat • Professor of Linguistics and Electrical and Computer Engineering at the University of Illinois and head of the Computational Linguistics Lab at the Beckman Institute. • Received his Ph.D. from MIT in 1985 and has since then worked also at AT&T Bell Labs. • A well-known expert in language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to-speech synthesis, writing systems, and text-to-scene conversion.

Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006