1 / 53

Prague Dependency Treebank(s) Workshop at LSA2011, Part II

Prague Dependency Treebank(s) Workshop at LSA2011, Part II. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part II - Syntax and Semantics.

guang
Télécharger la présentation

Prague Dependency Treebank(s) Workshop at LSA2011, Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prague Dependency Treebank(s)Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

  2. Part II - Syntax and Semantics • Tectogrammatical representation • Valency lexicon • Languages • Czech, Arabic and English • Technical issues • Annotation scheme and format • Tools for annotation • Applications • Summary, pointers, conclusion LSA 2011 Prague Dependency Treebanks II

  3. PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA 2011 Prague Dependency Treebanks II

  4. Layer 3 (t-layer): Tectogrammatical • Underlying (deep) syntax • 4 sublayers (integrated): • dependency structure, (detailed) functors • valency annotation • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... • Total • 39 attributes (vs. 5 at m-layer, 2 at a-layer) LSA 2011 Prague Dependency Treebanks II

  5. Underlying verb + tense Deep function Elided Actor in Another ellipsis... Prepositions out Analytical vs. Tectogrammatical (TR: sublayer 1 only shown) LSA 2011 Prague Dependency Treebanks II

  6. Layer 3: Tectogrammatical • Underlying (deep) syntax • 4 sublayers: • dependency structure, (detailed) functors • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... LSA 2011 Prague Dependency Treebanks II

  7. Tectogrammatical Functors • “Actants”: ACT, PAT, EFF, ADDR, ORIG • modify: verbs, nouns, adjectives • cannot repeat in a clause, usually obligatory • Free modifications (~ 50), semantically defined • can repeat; optional, sometimes obligatory • Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, ... • Special • Coordination, Rhematizers, Foreign phrases,... semantic syntactic LSA 2011 Prague Dependency Treebanks II

  8. Tectogrammatical Example • Analytical verb form: • (he) allowed would-be to-be enrolled • směl by být zapsán Collapsed Additional attributes (grammatemes): conditional + “allow” LSA 2011 Prague Dependency Treebanks II

  9. Tectogrammatical Example • Passive construction (action) • (The) book has-been translated [by Mr. X] • Kniha byla přeložena Disappeared Added LSA 2011 Prague Dependency Treebanks II

  10. Tectogrammatical Example • Object • (he) gave him a-book • dal mu knihu Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame LSA 2011 Prague Dependency Treebanks II

  11. Tectogrammatical Example • Incomplete phrases • Peter works well , but Paul badly • Petr pracuje dobře, ale Pavel špatně Added LSA 2011 Prague Dependency Treebanks II

  12. Layer 3: Tectogrammatical • Underlying (deep) syntax • 4 sublayers: • dependency structure, (detailed) functors • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... LSA 2011 Prague Dependency Treebanks II

  13. Analytical dep. tree: Deep Word OrderTopic/Focus • Example: • Baker bakes rolls. vs. BakerIC bakes rolls. LSA 2011 Prague Dependency Treebanks II

  14. Deep Word OrderTopic/Focus • Deep word order: • from “old” information to the “new” one (left-to-right) at every level (head included) • projectivity by definition (almost...) • i.e., partial level-based order -> total d.w.o. • Topic/focus/contrastive topic • attribute of every node (t, f, c) • restricted by d.w.o. and other constraints LSA 2011 Prague Dependency Treebanks II

  15. Layer 3: Tectogrammatical • Underlying (deep) syntax • 4 sublayers: • dependency structure, (detailed) functors • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... LSA 2011 Prague Dependency Treebanks II

  16. Coreference • Grammatical • relative clauses • which, who • Peter and Paul, who ... • control • infinitival constructions • John promised to go ... • reflexive pronouns • {him,her,thme}self(-ves) • Mary saw herself in ... • promise • PRED • go • John • PAT • ACT • home • he • DIR3 • ACT LSA 2011 Prague Dependency Treebanks II

  17. Coreference • Textual • Ex.: Peter moved to Iowa after he finished his PhD. LSA 2011 Prague Dependency Treebanks II

  18. Layer 3: Tectogrammatical • Underlying (deep) syntax • 4 sublayers: • dependency structure, (detailed) functors • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... LSA 2011 Prague Dependency Treebanks II

  19. Grammatemes • Detailed functors (subfunctors) • only for some functors: • TWHEN: before/after • LOC: next-to, behind, in-front-of, ... • also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT • Lexical (underlying) • number (SG/PL), tense, modality, degree of comparison, ... • strictly only where necessary (agreement!) LSA 2011 Prague Dependency Treebanks II

  20. Example - simplified view Se zuby jsem měl v minulosti jen problémy. With teeth I-have had in the-past only problems. LSA 2011 Prague Dependency Treebanks II

  21. Fully Annotated Sentence The boundaries of some problems seem to be clearer after they were revived by Havel’s speech. LSA 2011 Prague Dependency Treebanks II

  22. Arabic Example:Tectogrammatics • In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it. LSA 2011 Prague Dependency Treebanks II

  23. English PDT-style Annotation • Morphology and Syntax • By conversion • Tectogrammatical annotation • Guidelines (English TR: by S. Cinková) • Pre-annotation • Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al. • Valency • From Propbank Frame Files (Cinková, Šindlerová, Nedolužko, Semecký) LSA 2011 Prague Dependency Treebanks II

  24. Example - English TR • Words • Dependencies • Sem. function • Valency (predicates) • Coref (BBN) • Named Entities (BBN) LSA 2011 Prague Dependency Treebanks II

  25. Valency in the PDT Valency:specific ability of a word tocombine itselfwith other units of meaning dát (give) neděle (Sunday) matka (mother) dar (gift) Eva PAT TWHEN ADDR ACT Modifies anything Specific behavior pršet (rain) plakat (cry) --- zítra (tomorrow) noc (night) Adam TWHEN ACT TWHEN LSA 2011 Prague Dependency Treebanks II

  26. Valency - Basic Principles inner participants vs. free modifications (arguments vs.adjuncts) obligatory vs. optional modifications (the dialogue test) LSA 2011 Prague Dependency Treebanks II

  27. ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5) • each occurs just with particular verbs • each modifies the verb only once (in a clause) Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70) • can modify in principle any verb • can be repeated (within the same clause) Inner Participant … … Free Modification LSA 2011 Prague Dependency Treebanks II

  28. Inner Participants syntactic criteria - Actor and Patient semantic criteria for other inner participants (if a verb has more than two arguments) Addressee Argument shifting Patient Actor Origin Petr has dug a hole. Effect Semantic Effect (as a cognitive role) shifted to the position of Patient. The teacher asked a pupil. Semantic Addresse shifted to the position of Patient. LSA 2011 Prague Dependency Treebanks II

  29. Obligatory … Optional The Dialogue Test Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know. A: John left. B: From where? A: *I don't know. A: John left. B: To where? A: I don't know. „from where“  obligatory modification „to where“  optional modification LSA 2011 Prague Dependency Treebanks II

  30. frame1: ACT PAT frame2: ACT DIR1 Valency frame Contents: Structure: • functor • obligatoriness • surface form one meaning of the wordone valency frame word: leave meaning 1:sb left sth meaning 2:sb left from somewhere LSA 2011 Prague Dependency Treebanks II

  31. Valency lexicon:PDT-VALLEX • 8500 verb senses / valency frames • 9000 noun sense / valency frames • some adjectives and adverbs PDT-VALLEX Entry verb: dosáhnout meaning 1: to reach sth meaning 2: to getsb to do sth meaning 3: … meaning 4: … LSA 2011 Prague Dependency Treebanks II

  32. The PDT-VALLEX editor senses: ‘lay down’ resign win ask LSA 2011 Prague Dependency Treebanks II

  33. Valency Lexicon and TrEd to write sth (about sth) LSA 2011 Prague Dependency Treebanks II

  34. Corpus <-> Valency Lexicon • Corpus– occurrences of „uzavřít“ (to close): Sentence 15345: Sentence 51042: Sentence 2035: ENTRY: uzavřít vf1: ACT(.1) CPHR({smlouva}.4) ex: u. dohodu (close a contract) vf2: ACT(.1) PAT(.4) ex.: u. pokoj (close a room, house) • Lexicon: LSA 2011 Prague Dependency Treebanks II

  35. Valency and Text Generation • Tectogrammatical Representation • has all the information to (re)generate the surface form of the sentence: • in a “generalized” form • non-redundant (almost... but for generation, it is o.k.) • ...except the links to a-layer, however • links used only for training [statistical models for] parsing/generation modules • not present when e.g. doing text planning, translation, ... • valency dictionary: form of “learned” knowledge LSA 2011 Prague Dependency Treebanks II

  36. Valency and Text Generation • Using valency for... • ...getting the correct (lemma, tag) of verb arguments • Example: • VALLEX entry: starat (se) ACT(.1) PAT(o.[.4]) starat V.............. starat_se PRED “to take care of” o ............... Martin ....1.......... se ............... Martin ACT tygr PAT • “tiger” “Martin takes care of tigers.” tygr ....4.......... Martin se stará o tygry. LSA 2011 Prague Dependency Treebanks II

  37. The Annotation Process • 4 sublayers • work on structure first, rest in parallel • Structure • automatic preprocessing - programmed conversion from analytical layer annotation • Grammatemes • mostly automatically (based on lower layers’ annotation), manual checking, corrections • Cross-sublayer/cross-layer checking • partly automatic, then manual LSA 2011 Prague Dependency Treebanks II

  38. The Annotation ProcessScheme LSA 2011 Prague Dependency Treebanks II

  39. Tectogrammatical Annotation Tools • Manual annotation • 4 groups of annotators ~ 4 sublayers • Special graphical tool (TrEd) • Customizable graphical tree editor • Preprocessing • Data from analytical layer, preprocessed • Online dependency function preassignment LSA 2011 Prague Dependency Treebanks II

  40. The Annotation Scheme • XML + principles of linear- and tree-based standoff annotation  PML (Prague Markup Language) • Layer schemes (Relax NG) • PDT/PADT: t(ecto), a(nalytic), m(orphology), … • English: + phrase-based (p-layer) LSA 2011 Prague Dependency Treebanks II

  41. PML/XML Annotation Layers LFG analogy: f-struct Φ c-struct • Strictly top-down links • w+m+a can be easily “knitted” • API for cross-layer access (programming) • PML Schema / Relax NG • [z and audio layers: used for spoken data (audio as layer “-1”)] • BYL BYS ČELO LESA … z-layer audio LSA 2011 Prague Dependency Treebanks II

  42. Pointer to w-layer The Prague Markup Language Example • m-layer data, linked to w-layer: <m id="m-tr/_12941_01_00013.fs-s1w4"> <src.rf>manual</src.rf> <w> <dest.rf>w#w-tr/_12941_01_00013.fs-s1w4</dest.rf> <trans>basic</trans> </w> <form>pocházela</form> <lemma>pocházet_:T</lemma> <tag>VpQW---XR-AA---</tag> </m> <m id="m-tr/_12941_01_00013.fs-s1w5"> ... LSA 2011 Prague Dependency Treebanks II

  43. Searching the Treebanks • TrEd extension: PML-TQ • Backend: database server • Frontend: TrEd or Web browser • Web access • http://euler.ms.mff.cuni.cz:8111 • Sample data (Czech, English [soon]): • anonymous / anonymous • Full access (LSA 2011 particiapnts only, 2011): • LSA2011 / UC.Boulder • Full access: licence needed for the corpora • Available later this year at http://www.lindat.cz LSA 2011 Prague Dependency Treebanks II

  44. Using the Results: Parsing • Several parsers of Czech • Analytical layer dependency syntax • Trained on PDT 1.0 data, 1.2 mil. words • Collins(98), Charniak(00), Žabokrtský(02), Ribarov(04),Nivre(05),Zeman(05),McDonald(05), CoNLL’06 (19 parsers) • Best results • accuracy: percent of correct dependencies: • 84-85% for a single parser, > 86% for a combination LSA 2011 Prague Dependency Treebanks II

  45. Tectogrammatical Parsing • Newest results: • 4 phases • Transformation -based learning • FnTBL • Largely langu- age independent • Coreference: >90% m- and a-layer:Attributemanualautostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 % LSA 2011 Prague Dependency Treebanks II

  46. transfer source target Tectogrammatical Layer in Machine Translation • The Translation (“Vauquois”) triangle Tectogrammatical Representation Surface Syntax Generation Morphology En Cz LSA 2011 Prague Dependency Treebanks II

  47. Dependency trees in MT According to his opinion UAL's executives were misinformed about the financing of the original transaction. Transfer: - structure (~0) • lexical • functions • grammatical Podle jeho názoru bylo vedení UAL o financování původní transakce nesprávně informováno. LSA 2011 Prague Dependency Treebanks II

  48. Analytical LayerCorrespondence LSA 2011 Prague Dependency Treebanks II

  49. TectogrammaticalCorrespondence The [Homestead’s] only remaining baker bakes the most famous rolls to the north of Long River. ‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River. LSA 2011 Prague Dependency Treebanks II

  50. Valency and Translation • leave: • leave-1 • to leave [from] somewhere • leave-2 • to leave sth for sb • Translating (from English into Czech): • which equivalent to chose? • nechat vs. odjet/opustit • which prepositions, cases, ... to use? • accusative vs. “z” (“from”) with genitive vs. ...? LSA 2011 Prague Dependency Treebanks II

More Related