1 / 71

The Syntax-Morphology Interface and Natural Language Processing

The Syntax-Morphology Interface and Natural Language Processing. Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu. Thematic Training Course on Processing Morphologically Rich Languages 11-15 April 2011. Outline. Introduction

kalani
Télécharger la présentation

The Syntax-Morphology Interface and Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training Course on Processing Morphologically Rich Languages 11-15 April 2011

  2. Outline • Introduction • Syntax vs. morphology from a linguistic viewpoint • Morphological coding systems in Hungarian • Morphosyntactic information in Hungarian corpora • Language-specific morphosyntactic problems • Effects on IE, NER and MT ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  3. Syntax vs. morphology • Typological differences among languages • Agglutinative lg: role of morphology is stronger (lot of information in morphemes) • Isolating lg: role of syntax is stronger (less morphemes, more constructions) • Focus on Hungarian (agglutinative) and English (fusional/isolating) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  4. Basic Hungarian syntax • Lot of information encoded in morphemes • No fixed word order • Information structure is reflected in word order (theme-rheme, old-new) Péter szereti Marit. Peter love-3SgObj Mary-ACC ‘Peter loves Mary.’ Péter Marit szereti. ‘It is Mary who Peter loves.’ Marit szereti Péter. ‘It is Mary who Peter loves.’ Marit Péter szereti. ‘It is Peter who loves Mary.’ Szereti Péter Marit. ‘Peter LOVES Mary (and not hates).’ Szereti Marit Péter. ‘Peter LOVES Mary (and not hates).’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  5. Morphosyntactic features of Hungarian • Nominal declination (nouns, adjectives, numerals) • Verbal conjugation • Several hundreds of word forms for each lemma • Grammatical relations encoded primarily by morphemes -> morpho + syntactic ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  6. Nominal suffixes A stem can be extended by: • Derivational suffixes • Plural • Possessive • Case suffixes hat-ás-a-i-nak ‘to its effects’ stem-DERIV.SUFF-POSS-POSS.PL-DAT egész-ség-ed-re ‘cheers’ stem-DERIV.SUFF-POSS.Sg2-SUB ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  7. Case suffixes in Hungarian • ~20 cases („rare” cases are not always counted: distributive-temporal (-nte), associative (-stul/-stül…)) • always at the right end of the word form • grammatical relations are encoded: • Arguments of the verb • Adjuncts (temporal and locative adverbials) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  8. …and in English Pisti szerdánként edzésre jár. Steve Wednesday-DIST-TEMP training-SUB go-3Sg Each Wednesday Steve goes to training. Szerdánként – each Wednesday Edzésre – to training ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  9. Pisti bort iszik. Steve wine-ACC drink-3Sg Steve is drinking wine. Pisti-NOM – Steve – subject Bort – wine - object ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  10. A fiú kutyája The boy dog-POSS The boy’s dog A(z ő) kutyája The (he) dog-POSS His dog Possessor in nominative Possessed with a possessive marker A fiúnak a kutyája The boy-DAT the dog-POSS Possessor in dative Possessed with a possessive marker Possessive in Hungarian ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  11. The boy’s dog His dog Possessor with a possessive marker (pronoun) Possessed with no marker The dog of the boy Possessive relation is marked by a preposition …and in English ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  12. Hungarian vs. English - nouns • Number of word forms: several hundreds (HU) vs. 2-3 (EN) • Means to express grammatical relations: • Suffixes (HU) • Preposition, fixed position (word order), suffix, determiner (EN) • Methods for morphological parsing are very different for Hungarian and English ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  13. Verbal suffixes A stem can be extended by: • Derivational suffixes • Mood markers • Tense markers • Person/number suffixes • Objective markers Vág-at-ná-k Cut-CAUS-COND-3PlObj ‘they would have it cut’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  14. Mood and tense in Hungarian • Mood: • Indicative: default (not marked) • Conditional: suffixes (present) – analytic form (past) • Imperative: suffixes • Tense: • Present: default (not marked) • Past: suffixes • Future: analytic (auxiliary fog) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  15. …and in English • Mood: • Indicative: default (not marked) • Conditional: past tense forms + analytic forms (auxiliary would) • Imperative: auxiliaries + grammatical structure • Tense: • Present: default (not marked) • Past: suffix / irregular forms (suppletives or ablaut (vowel change)) • Future: analytic (auxiliary will) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  16. Hungarian: suffixes Fut-ok Fut-sz Fut Fut-unk Fut-tok Fut-nak 3Sg is the default (not marked!) English: 3Sg + pronouns / obligatory subject I run You run He runs We run You run They run 3Sg marked! Person & Number ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  17. Possibility/permission: fut-hat-ok run-MOD-1Sg ‘I may run’ Reflexive: mos-akod-unk wash-REFL-1Pl ‘we wash ourselves’ Frequentative: üt-öget-sz hit-FREQ-2Sg ‘you hit sg repeatedly’ Causative: csinál-tat-nak do-CAUS-3Pl ‘they have sg done’ Derivational suffixes in Hungarian ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  18. … and in English • Possibility/permission: auxiliaries • Reflexive: pronominal objects • Frequentative: adverb • Causative: construction ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  19. Hungarian vs. English - verbs • Number of word forms: several hundreds (HU) vs. 4-5 (EN) • Means to express grammatical relations: • Suffixes + auxiliaries (HU) • Auxiliaries + reflexive pronouns + constructions (EN) • A lot of syntactic information is encoded in Hungarian morphemes ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  20. ThematicTrainingCourseonProcessingMorphologicallyRichLanguagesThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  21. Morphosyntactic coding systems • Language independent (?) • Language dependent • (dis)advantages: • comparability • considering language-specific features • complexity • Different information is necessary for each language ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  22. Hungarian coding systems • HUMOR • recall Thursday Session 1  • in the Hungarian National Corpus • MSD • In Szeged Treebank • Parser and POS-tagger available at: http://www.inf.u-szeged.hu/rgai/magyarlanc • KR • No database • Parser and POS-tagger available at: http://mokk.bme.hu/resources/hunmorph/index_html http://code.google.com/p/hunpos/ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  23. MSD • Morphosyntactic Description • International coding system: • English • Romanian • Slovenian • Czech • Bulgarian • Estonian • Hungarian ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  24. MSD - 2 • Positional codes • A given position encodes a given type of information • Position 0: part-of-speech • Position 1: (sub)type within POS • Further positions: other grammatical information (person, number, case, etc.) • Irrelevant positions are marked with a hyphen (-) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  25. KR • Created for Hungarian • Hierarchical attribute-value matrices • Default values (3Sg, singular…) • Derivational information is encoded • Compounds are also segmented ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  26. MSD vs. KR • Differences between the two systems: • derivation • compounds • Harmonization efforts in order to build a morphological parser the output of which is in total harmony with the Szeged Treebank (magyarlanc) (Farkas et al. 2010) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  27. Nouns in MSD ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  28. Verbs in MSD ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  29. Morphosyntactically annotated Hungarian corpora • Hungarian National Corpus • 100-million-word balanced reference corpus of present-day Hungarian • Word forms automatically annotated for stem, part of speech and inflectional information • http://corpus.nytud.hu/mnsz/index_eng.html • Szeged Treebank • 1-million words, 82K sentences • Manually annotated for lemma, POS-tags • Constituency and dependency trees • http://www.inf.u-szeged.hu/rgai/nlp ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  30. Szeged Treebank • Manually annotated treebank for Hungarian • Covers various linguistics styles • literature, newspapers, laws, student essays, computer books, etc. • multilingual connection: Orwell’s 1984; Win2000 manual in Hungarian • Available free of charge for research • Developed by • University of Szeged, HLT group • MorphoLogic Ltd. • Academy of Sciences, Research Institute for Linguistics ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  31. Szeged Treebank 2. • TEI XML format • Manually annotated • sentence split & word segmentation • morphological analysis • PTB-style syntactic structure • Verb argument structure • converted / extended to Dependency Grammar format manually ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  32. Szeged Treebank 3. • Several versions • Constituency and dependency versions • Old MSD codes • New (harmonized) MSD codes • (dependency) parser under development • Being extended with folklore texts ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  33. Dependency vs. constituency • Each node corresponds to a word -> no virtual nodes (CP, I’…) in dependency trees • Constituency grammars said to be good for languages with fixed word order • Syntactic relations are determined • by the position in the tree (constituency grammar) • by dependency relations (labeled edges) (dependency) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  34. Constituency trees in SzT2.0 • Based on generative syntax (É. Kiss et al. 1999) • Syntactic features of Hungarian also considered (i.e. not hardcore Chomskyan trees) • Verb-argument relations are encoded by labels • Very detailed information: different grammatical role for each case suffix • Semantic information also can be found (temporal and locative adverbials) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  35. Aggie all relative-POSS-ACC the day before yesterday see-PAST-3Sg-Obj guest-ESS ‘Aggie received all of her relatives the day before yesterday.’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  36. ThematicTrainingCourseonProcessingMorphologicallyRichLanguagesThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  37. Dependency trees in Szeged Dependency Treebank • Based on SzT2.0 • Automatic conversion and manual correction • Word forms are the nodes of the tree • Simplified relations for nominal arguments: SUBJ, OBJ, DAT,OBL, ATT • Semantic information kept • Sentences without 3Sg copula are distinctively marked ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  38. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions. ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  39. Virtual nodes • No overt copula in present tense 3Sg • Only subject and predicative noun/adjective manifest • No syntactic structure in SzT (grammatical roles are not marked) • Virtual nodes in SzDT ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  40. I like to go to school because it is good to be at school though not always. ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  41. Szeged Treebank vs. Szeged Dependency Treebank • Labeled relations in both cases -> not so sharp contrast • Virtual nodes in SzDT -> grammatical structure marked for every sentence (IE, MT) • No word order constraints in SzDT • Word forms are marked • Other possibilities: morpheme-based syntax (Prószéky et al. (1989), Koutny, Wacha (1991)) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  42. Language-specific morphosyntactic problems • Morphology vs. syntax: • Pseudo-subjects • Pseudo-objects • Pseudo-datives • Morphological analysis of unknown words • Lemmatization of named entities ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  43. Pseudo-subjects • a noun in nominative is not the subject of the sentence -> special attention required when parsing • Possessor: a kisfiú labdája the boy ball-3SgPOSS the boy’s ball • Predicative noun: István juhász maradt. Stephen shepherd remain-PAST Stephen remained a shepherd. • Object: A kutyám kergeti a macska. The dog-POSS chase-3SgObj the cat ‘The cat is chasing my dog.’(garden path sentence) A fiam szereti a lányod. The son-1SgPOSS love-3SgObj the daughter-2SgPOSS ‘My son loves your daughter’ or ‘Your daughter loves my son’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  44. Solutions • Possessor: • SzT: one NP includes the possessor and the possessed ((a kisfiú) labdája) • SzDT: ATT relation • Predicative noun: PRED relation • Virtual node in SzDT • Object: OBJ relation • Sometimes contextual information is needed even for humans… ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  45. Pseudo-objects Adverbials with an apparently accusative ending: Futottam egy jót. Run-PAST-1Sg a good-ACC I have had a good run. Nagyot aludtam. Big-ACC sleep-PAST-1Sg I have slept a lot. Intransitive verbs -> cannot be an object -> MODE relation ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  46. Pseudo-datives Not all (semantic) subjects are in nominative: • Dative subject: Sándornak kell elrendeznie az ügyeket. Alexander-DAT must arrange-INF-3Sg the issue-PL Alexander has to arrange the issues. • DAT in both corpora • Certain auxiliaries with dative subjects (exceptions) • Dative-nominative parallelism in possessive as well ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  47. Unknown words can be: Compounds Named entities Derivations fémkapunk félmillió csokinyúl NATO-hoz Methods for analysis (Zsibrita et al. 2010): Segmentation into two or more analyzable parts Expert rules to filter impossible combinations (*V+N) Analysis of the last part goes to the whole word Substitution for hyphenated words (pre-defined patterns for each morphological class) Unknown words ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  48. félmillió fél+millió Mc-snl Expert rules: NUM + NUM * non-NUM + NUM ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  49. fémkapunk fém+kap+unk Vmip1p---n fém+kapu+nk Nc-sn---p1 Expert rules: N + N N-nonNOM + V * N-NOM + V ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

  50. csokinyúl csoki+nyúl Vmip3s---n Nc-sn cso+kinyúl (?) Vmip3s---n Expert rules: N + N N-nonNOM + V * N-NOM + V ThematicTrainingCourseonProcessingMorphologicallyRichLanguages

More Related