400 likes | 412 Vues
LingBrowser is an active tool for linguistic exploration on real Turkish text, showcasing NLP resources. It includes morphological and semantic features, with ongoing work on multi-word constructs and syntactic relations. The prototype utilizes finite state transducers for lexical and surface morpheme analysis, pronunciation representation, and more.
E N D
Using Finite State Technology in a Tool for Linguistic Exploration Kemal Oflazer, Mehmet Erbaş, Müge Erdoğmuş Sabancı University Istanbul, Turkey
Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples.
Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples. • A showcase for the natural language processing resources and technology (for Turkish in this case)
Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples. • A showcase for the natural language processing resources and technology (for Turkish in this case) • A testbed for the use of NLP in (native and foreign) language learning.
Background and Motivation • Joint work with UC Berkeley, recently funded by US and Turkish NSF as a 3-year joint project, as a follow-up project to: • TELL – Turkish Electronic Living Lexicon (US NSF) • A Unified Electronic Lexicon Of Turkish (US and Turkish NSF)
Turkish • Agglutinative morphology with many morphophonological processes • e.g., vowel harmony • pronunciation (phoneme selection/stress position) is a function morphological structure and function, and lexical semantics • lots of derivational processes • semi/non-lexicalized collocations • free constituent order
LingBrowser Functionality (Current Prototype) • Access to linguistic information in arbitrary Turkish Web content and text • Lexical • phonological • phonemes, syllables, stress position • morphological • Lexical and surface morpheme structure, morphological features encoded • Semantic • dictionary access • WordNet access, • root word translation
LingBrowser Functionality(On-going Work and Future) • Access to linguistic information in arbitrary Turkish Web content and text • Multi-word constructs • Named-entity identification • Surface syntax • NP extraction and structure display • Surface syntactic relations • Lexical Translation/Paraphrasing • Phrasal translation
LingBrowser Prototype • Morphological Analysis
LingBrowser Prototype • Surface Morpheme Structure
LingBrowser Prototype • Lexical Morpheme Structure
LingBrowser Prototype • Aligned Lexical Surface Structure
LingBrowser Prototype • Pronunciation Representation (SAMPA) • Interleaved • Parallel
LingBrowser Prototype • WordNet Lookups (via aligned Turkish and English Wordnets) • English translations/glosses of the root word • Turkish Synonyms
LingBrowser Prototype • Word Concordances • Morphological Concordance • All forms with the selected root / POS combination are listed in context • one can see possible objects of a verb regardless of the inflected/derived form it appears in • Much more meaningful for languages like Turkish, Finnish, etc.
(Prototype) Implementation • LingBrowser (indirectly) employs almost all the finite state language resources we have built over the last 10 years • All built using Xerox xfst, lexc and twolc • Indirectly via a database interface
Finite State Transducers Employed Total of 750 xfst regular expressions + 100K root words (mostly proper names) over about 50 files Stress Computation Transducer Syllabification Transducer Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer Surface form
Finite State Transducers Employed Two-Level Morphological Analyzer 1M States, 1.6 M Transitions Stress Computation Transducer Syllabification Transducer • ev+Noun+A3sg+P3sg+Loc • ev+Noun+A3sg+P2sg+Loc Exceptional Phonology Transducer Feature Form SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form
Finite State Transducers Employed Lexical Morphemes Transducer ~400K States, 1M Transitions Stress Computation Transducer Syllabification Transducer • ev+sH+ndA • ev+Hn+DA Exceptional Phonology Transducer Lexical Morpheme Sequence SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form
Finite State Transducers Employed Surface Morphemes Transducer ~560K States, 1.4M Transitions Stress Computation Transducer • ev+i+nde • ev+in+de Syllabification Transducer Surface Morpheme Sequence Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form
Finite State Transducers Employed Pronunciation e – v i n – “d e Pronunciation Lexicon Transducer ~6.5M States, 8.5M Transitions Stress Computation Transducer Syllabification Transducer Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form
Finite State Transducers Employed • Aligned pairs transducer • Input is the surface form • Output is a representation of the aligned lexical-surface feasible pairs; e.g. for evinde wewant to produce • ev+Hn+DA ev+sH+nDA • ev0in0de ev00i0nde evinde
Aligned-pairs Transducer • We use a modified version of the two-level rule transducer • Feasible pair a:b is replaced with "a-b":b • A rule like a:b => LC _ RCis rewritten as "a-b":b => LC' _ RC‘where contexts are in terms of the new feasible pairs • Let’s call this the AlignedTwoLevelTransducer
Aligned-pairs Transducer • A MapToPairs transducer maps each lexical symbol in the original grammar to the representations of the feasible pairs in the original grammar in which it is the lexical side • e.g., if we have A:a, A:e and A:0 as three feasible pairs with A on the lexical side, • then MapToPairs maps A to "A-a ", "A-e"and"A:0"
Aligned-pairs Transducer Feature Symbols Lexicon Transducer Lexical Symbols
Aligned-pairs Transducer The new transducer accepts all lexical symbol sequences allowed by the morphotactic constraints. Feature Symbols Lexical Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols
Aligned-pairs Transducer This transducer maps lexical symbol sequences to valid possible feasible pair sequences Feasible-pair symbols MapToPairs Feature Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols
Aligned-pairs Transducer This transducer accepts all potentially valid feasible pair sequences. Feasible-pair symbols Feasible-pair symbols MapToPairs Feasible-pair sequence Recognizer Extract Upper Side Feature Symbols Extract Lower Side Feasible-pair symbols Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols
Aligned-pairs Transducer This transducer maps surface forms to feasible pair sequences subject to morphographemic and morphotactic constraints. Feasible-pair symbols Feasible-pair symbols MapToPairs Feasible-pair sequence Transducer Extract Upper Side Feature Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l AlignedTwoLevelTransducer Lexical Symbols Lexical Symbols Surface Symbols
Aligned-pairs Transducer • ev+Hn+DA ev+sH+nDA • |||||||| ||||||||| • ev0in0de ev00i0nde Feasible-pair sequence Transducer AlignedTwoLevelTransducer evinde
Implementation • Other resources used • Turkish WordNet aligned with the English WordNet • Current prototype was implemented in 4 months as a senior project, on MS .NET platform • Now being ported to Java
Implementation • Text is annotated on the background with multiple threads • All text items are reverse indexed on relevant features (morphemes, features, syllables, phonemes, etc) for fast search, e.g., • Find all bi-syllabic words with an open syllable ending in “a” • Find all words with bi-syllabic roots with a long root final vowel • Find all finite verbs in future tense with 3rd plu agreement. • Find all words using the lexical morpheme +sHz” • Find all words in which lexical “+sH” is aligned with surface “00u” • Find all words with syllables with multiconsonant codas
Future Functionality • Lexical paraphrasing • evimizdekiler (those things) in our house
Future Functionality • Lexical paraphrasing • evimizdekiler (those things) in our house • Gets nasty when multiple derivations are present • Finlandiyalılaştıramadıklarımızdanmışsınızcasına (behaving) as one of those who we could not convert into a Finn(ish citizen) • Tree transducers
Future Functionality • Lexical paraphrasing • evimizdekiler (those things) in our house • Gets nasty when multiple derivations are present • Finlandiyalılaştıramadıklarımızdanmışsınızcasına (behaving) as one of those who we could not convert into a Finn(ish citizen) • Tree transducers • Extensive explanatory feedback • Morphographemics (why is lexical s deleted?) • show triggering contexts in addition to the rule • Pronunciation (why is this syllable stressed?) • show exceptional stress morphemes and explain their intearction
Future Functionality • Drills • Generate surface form from lexical form • Segment into surface morphemes • Identify morphosyntactic features encoded by morphemes • Generate surface form from a set of features
Future Functionality • Surface syntactic relations Eski Mısır kültüründe, çocuğa akıllı küçük denilmekteydi. Küçük yetişkin deyimi geleneksel toplumların çocuğu yetişkin yaşamına teşvik eden işleriyle kabul gördü. Ortaçağ'da ise, Avrupa'da çocuklara küçük hayvanlar denildi. Sanayileşme bu kültürel ayırımı hayata geçirerek çocuğu yetişkin yaşamından kopardı. Çocukluğu yetişkinlikten ayrı bir döneme indirgemek, çocukların geleceğe uyumlarını güçleştirecektir. Kaldı ki, bilgi toplumunda öylesi bir soyutlamanın, yani çocukluğun yetişkinlikten ayrı tutulmasının, imkansız denecek hale geldiği ise, açık bir gerçektir... sanayi+Noun+..^DB+Verb+Become..^DB+Noun+Inf+..+Nom Subject kop+Verb^DB+Verb+Caus+Past+A3sg
Planned Deployment • We expect to have a version to be tested in Sharon Inkelas’ Linguistics course at Berkeley, by Fall 2006.
Summary • LingBrowser is an active and interactive tool for linguistic exploration on real (Turkish) text • Query • Search • See explanations • Extensive use of finite state language resources • Being extended to included additional functionality.