250 likes | 358 Vues
This seminar presentation discusses the morphological analysis of Spanish strings and the generation of strings from morphological descriptions. It covers the formalization, discovery, and implementation of Spanish morphological rules, focusing on verbs, nouns, and adjectives. The use of Xerox Finite-State Tools and evaluation criteria are explained, along with testing methods and coverage assessments. Possible improvements and references to related works are also provided.
E N D
ACL 4 NCLT Seminar Presentation, 7th June 2006 John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers
Introduction • What is this project about? • Provide morphological information on Spanish strings • Generate strings from morphologcal descriptions • What were my aims? • Robust, fast, application – easily integrated into other systems • 80% token coverage on unrestricted text • 100% coverage of Spanish morphology
Design Methodology • Formalisation • Discovery of Spanish morphological rules • Implementation • Coding of morphological model with Xerox Finite-State Tools • Evaluation • Check for accuracy & well-formedness • Assess language coverage
Spanish Morphology - Verbs • Inflected for person, tense/mood, number • Regular verbs • 3 regular conjugations identified by infinitive endings • ‘-ar’, ‘-er’, and ‘-ir’ • Irregular verbs • 66 distinct irregularities • Varying degrees of irregularity
Spanish Morphology - Nouns • Inflected for number, gender • 7 types of noun • Feminine, masculine, neutral, derivative, profession, number invariant, proper • Irregularities • All arise via pluralisation • Accentuation, character alterations
Spanish Morphology - Adjectives • Inflected for number, gender • 4 types of adjective • Neutral, derivative, profession, irregular • Adverbs derived from adjectives by addition of suffix ‘mente’
Xerox-Finite State Tools - lexc • Lexicon compiler • Compiles ‘continuation classes’ into lexical transducers
Xerox Finite-State Tools - xfst • Xerox finite-state tool • Compiles regular expressions into networks • Regular expression replace rules [ String -> Replacement || left-context _ right-context ]
Xerox Finite-State Tool - example • conocer - ‘to know’ • 1st person, pres. ind. ‘conozco’ • Lexical transducer mappings • conoc:conoc • er+Verb:ε • +PresInd:^PresInd • +1P+Sg:o
Xerox Finite-State Tool - example cont… • Composed replace rule [ c -> {zc} || _ ^PresInd ] • Triggered by the ^PresInd tag • Makes required changes, remove trigger
Verb Lexicon • Coded in lexc • Model has 3 regular paths • 66 varieties of irregularity • e.g. poder ‘to be able to’ LEXICON Irreg43 0:^UE^VSoue^PRET1^FR ErV ; [o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?* [%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]
Noun Lexicon LEXICON NounFem ! Feminine Nouns !STEM !CONT. CLASS ! GLOSS acción fIsNounEs ; ! action LEXICON fIsNounEs ! feminine pluralised with 'es' +Noun:0 fNounPluralES ; LEXICON fNounPluralES +Sg+Fem:0 # ; +Pl+Fem:^NZ^NOes # ; [z -> c || _ %^NZ] [ó -> o || _ ?^<5 %^NO ]
Adjective Lexicon • Same process as noun lexicon • Uses the same replace rules • One exception for adverbs LEXICON nIsAdjS +Adj:0 nAdjPluralS ; +Adj|+Adv:^AAOmente # ; [o -> a || _ %^NAO %^AAO {mente}]
Other Transducers • Overgeneration Filter • llover ‘to rain’ • Capitalisation • Trigger Remover • Execution script ~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ] [ a (->) A || .#. _ ] [ %^IE -> 0 ]
Testing • Accuracy • Maintaining integrity of existing rules • Projection • Subtraction • Well-formedness • Ensuring tag order
Assessing Coverage • Aim – 80% on unrestricted text • Statistical predictions (Crystal 1997) • Corpus compilation and processing • Europarl, 3 corpora (http://people.csail.mit.edu/koehn/publications/europarl/ ) • Phase 1 – augmentation • Phase 2 – 81% coverage • Final assessment – 84.15% coverage
Further Details • Generates approx. 44,000 unique morphological descriptions • Evaluation corpus – 1.26 analyses per input token on average
Possible improvements • Increase coverage • lexicon augmentation • Disambiguation using POS tagger • More derivational morphology • Deal with different dialects of Spanish
References • (Beesley & Karttunen 2003) Beesley, K. and Karttunen, L., Finite State Morphology, CSLI Publications, United States, 2003. • (Claret 2005) Los Verbos Castellanos Conjugados, Sexta Edición, Editorial Claret, Barcelona, 2005 • (Crystal 1997) Crystal, D., The Cambridge Encyclopedia of Language. (2nd. ed.) Cambridge University Press, 1997 • Europarl - Europarl Parallel Corpus http://people.csail.mit.edu/koehn/publications/europarl/ - Last Accessed 19/05/2006 • (Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990. • (Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J. Collection Bescherelle - Les verbes espagnols. Hatier, 1997. • Real Academia Española – http://www.rae.es/ - Last Accessed 25/05/2006
Conclusions Demonstration
LEXICON ArVerbs !STEM !CONT. CLASS !GLOSS abord ArV ; !to approach LEXICON ArV ar+Verb:0 ArConj ; LEXICON ArConj !TAGS !CONT.CLASS +PresInd:^PresInd ArPresInd ; +PretInd:^PretInd ArPretInd ; LEXICON ArPresInd ! Present Indicative +1P+Sg:o^1PSg #; +2P+Sg:as^2PSg #; +3P+Sg:a^3PSg #;