1 / 33

Beyond parse trees: The Prague Dependency Treebank

Beyond parse trees: The Prague Dependency Treebank. Jan Haji č. The Prague Dependency Treebank Project (Czech Language Treebank). 1996-2004 1998 PDT v. 0.5 released (JHU workshop) 400k words annotated, unchecked 2001 PDT 1.0 released (LDC): 1.3MW annotated, morphology & surface syntax

alaire
Télécharger la présentation

Beyond parse trees: The Prague Dependency Treebank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond parse trees:The Prague Dependency Treebank Jan Hajič

  2. The Prague Dependency Treebank Project (Czech Language Treebank) • 1996-2004 • 1998 PDT v. 0.5 released (JHU workshop) • 400k words annotated, unchecked • 2001 PDT 1.0 released (LDC): • 1.3MW annotated, morphology & surface syntax • 2004 PDT 2.0 release planned • 0.8MW annotated, underlying (deep) syntax: the “tectogrammatical layer” • ?2004 MT Resources CD: RD, PTB Cz, Tools CLSP Tuesday Seminar

  3. Annotation Layers • Morphology • Tag (full morphology, 13 categories), lemma • Analytical layer (surface syntax) • Dependency, analytical function • Tectogrammatical layer (underlying syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order) CLSP Tuesday Seminar

  4. Morphological Annotation • 13 categories: CLSP Tuesday Seminar

  5. Layer 1: Morphology Ex.: “(to) the most uninteresting” • Tag: 13 categories • Example: AAFP3----3N---- Adjective no poss. Gendernegated Regular no poss. Numberno voice Feminine no personreserve1 Pluralno tensereserve2 Dative superlativebase var. • Lemma: unique identifier • Books/verb -> book-1, went -> go, to/prep. -> To-1 CLSP Tuesday Seminar

  6. governor dependent Layer 2: Analytical syntax • Dependency + Analytical Function The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. CLSP Tuesday Seminar

  7. Analytical functions • Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom • AuxT, AuxR, AuxO, AuxZ, AuxY • AuxP, AuxC • AuxX, AuxS, AuxG, AuxK • AtrAdv, AdvAtr; AtrObj, ObjAtr, AtrAtr • ExD • Coord, Apos; ..._Co, ..._Ap; ..._Pa CLSP Tuesday Seminar

  8. Layer 3: Tectogrammatical • Underlying (deep) syntax • 4 sublayers: • dependency structure, (detailed) functors • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... CLSP Tuesday Seminar

  9. Dependency structure • Similar to the surface (Analytical) layer... ...but: • certain nodes deleted • auxiliaries, non-autosemantic words, punctuation • some nodes added • based on word (mostly verb, noun) valency • some ellipsis resolution • detailed dependency relation labels (functors) CLSP Tuesday Seminar

  10. Underlying verb + tense Deep function Elided Actor in Another ellipsis... Prepositions out Analytical vs. Tectogrammatical annotation (TR: sublayer 1 only shown) (TR: sublayer 1 only shown) CLSP Tuesday Seminar

  11. Tectogrammatical Functors • “Actants”: ACT, PAT, EFF, ADDR, ORIG • cannot repeat in a clause, usually compulsory • Free modifications (~ 50) • can repeat; optional, sometimes compulsory • Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RESTR, DESC; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, • Special • Coordination, Rhematizers, Foreign phrases,... CLSP Tuesday Seminar

  12. Tectogrammatical Example • Analytical verb form: • (he) allowed would-be to-be enrolled • směl by být zapsán Collapsed Additional attributes (grammatemes): conditional + “allow” CLSP Tuesday Seminar

  13. Tectogrammatical Example • Passive construction (action) • (The) book has-been translated [by Mr. X] • Kniha byla přeložena Disappeared Added CLSP Tuesday Seminar

  14. Tectogrammatical Example • Object • (he) gave him a-book • dal mu knihu Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame CLSP Tuesday Seminar

  15. Tectogrammatical Example • Incomplete phrases • Peter works well , but Paul badly • Petr pracuje dobře, ale Pavel špatně Added CLSP Tuesday Seminar

  16. The Valency Lexicon • Valency frames • each verb (+ some nouns, adjectives) • has “slots” for functor/form pairs: • Basic set prepared in advance, annotators add entries on-the-go, checking and approval process follows (consistency) • Compare: Levin’s Classes, Proposition Bank give: ACT(Nom) PAT(Acc) ADDR(to+Dat) CLSP Tuesday Seminar

  17. Deep word order, topic/focus • Deep word order: • from “old” information to the “new” one (left-to-right) at every level (head included) • projectivity by definition • i.e., partial level-based order -> total d.w.o. • Topic/focus/contrastive topic • attribute of every node • restricted by d.w.o. and other constraints CLSP Tuesday Seminar

  18. Analytical dep. tree: Deep word order, topic/focus • Example: • Baker bakes rolls. vs. BakerIC bakes rolls. CLSP Tuesday Seminar

  19. TL: Current Status (Feb. 03) • Structure, functors, some grammatemes • 350.000 words • Coreference + topic-focus • started (~10.000 words) • Everything else • 300 sentences • Plan: 55.000 sentences ~ 800.000 words • English, German (automatically), Arabic CLSP Tuesday Seminar

  20. The Future • Lexical semantics - WSD • Czech EuroWordnet (Brno, FI MU) • 15000 nouns, 4000 verbs • currently being manually annotated (20kW) • Common representation • “language independent” • functors (ok), lemmas (??), grammatemes (?) • structure, TFA, coref: identical (?) CLSP Tuesday Seminar

  21. Tools • Morphological dictionary + Tagger(s) • Collins parser (-> analytical level) + Afun • PTB -> AR, deterministic rules • Deterministic transformation AR->TR • Czech & English; for Cze, FUNC labeling • Baseline MT system Eng<->Cze • incl. large dictionary CLSP Tuesday Seminar

  22. How can we use it? CLSP Tuesday Seminar

  23. Machine Translation • Machine Translation • Source --> intermediate --> Target • Intermediate representation: [Interlingua] -> tectogrammatical -> surface synt. • less “work” in the transfer phase • more work in parsing and generation • ...but advantage in multilingual MT application CLSP Tuesday Seminar

  24. The Basic Scheme • The additional three steps: Transfer (tectogrammatical) parsing tectogrammatical layer Generation analytical layer linearization (trivial) parsing morphological layer morphology (tagging) morph. synthesis (easy) source sentence target sentence CLSP Tuesday Seminar

  25. Types of Correspondence • Original Czech translation too far... • 50% 1:1 • 5% 1:2, 1:0, 0:1, 2:1 • each of the other type (~90 types!) once or twice • Retranslated Czech • 90% 1:1, 1:0, 1:2, 2:1, 0:1 • rest is bad (~40 types) CLSP Tuesday Seminar

  26. Comparing Czech and English Original Czech English Retranslated Czech Do tohoto “mikrofonu” pak začal zpívat. ‘this „mike“ Les began to sing. Do tohoto “mikrofonu” začal Les zpívat. CLSP Tuesday Seminar

  27. Comparing Czech and Arabic CLSP Tuesday Seminar

  28. Comparing Czech and Arabic The [Homestead’s] only remaining baker bakes the most famous roll s to the north of Long River. ‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River. CLSP Tuesday Seminar

  29. MT Results Czech - English CLSP Tuesday Seminar

  30. answer Question Answering • Question: Answer: CLSP Tuesday Seminar

  31. Question Answering • Subtree match • Except wh- words • Inclusion: • Question in answer • Answer: • Subtree corresponding to the wh- word • Yes/no questions • As above but no wh- word CLSP Tuesday Seminar

  32. Question Answering • Synonymy • Be ~ become ~ work ~ ... • Inferences • France got Luis XIV as its king in .... ~ Luis XIV was the king of France in ... • Partial answers • Nonempty intersection • More info • Coling ‘82 paper by Jirku CLSP Tuesday Seminar

  33. Some pointers • Current version of PDT: v1.0 • morphology + analytical level • 1.3M words (train/dev test/eval test) • http://ufal.mff.cuni.cz/pdt • Projects • http://www.ldc.upenn.edu • LDC2001T10 (PDT v1.0) • http://www.clsp.jhu.edu: Workshop 2002 • Using TL for MT Generation CLSP Tuesday Seminar

More Related