1 / 57

Columbia University NLP Colloquium

Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University. Columbia University NLP Colloquium. October 28, 2004. E. gist. gist. The Intuition Generation-Heavy Machine Translation. Espa ñ ol ‚ عربي ‚.

nitza
Télécharger la présentation

Columbia University NLP Colloquium

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral ResearcherCenter for Computational Learning SystemsColumbia University Columbia University NLP Colloquium October 28, 2004

  2. E gist gist The IntuitionGeneration-Heavy Machine Translation Español‚عربي ‚ Dictionary English

  3. IntroductionResearch Contributions • A general reusable and extensible Machine Translation (MT) model that transcends the need for large amounts of deepsymmetric knowledge • Development of reusable large-scale resources for English • A large-scale Spanish-English MT system: Matador; Matador is more robust across genre and produce more grammatical output than simple statistical or symbolic techniques

  4. Roadmap • Introduction • Generation-Heavy Machine Translation • Evaluation • Conclusion • Future Work

  5. Gisting Transfer Interlingua IntroductionMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  6. Dictionaries/Parallel Corpora Transfer Lexicons Interlingual Lexicons IntroductionMT Pyramid Source meaning Target meaning Source syntax Target syntax Source word Target word Analysis Generation

  7. IntroductionMT Pyramid Source meaning Target meaning Source syntax Transfer Target syntax Gisting Source word Target word

  8. IntroductionWhy gisting is not enough Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

  9. IntroductionTranslation Divergences • 35% of sentences in TREC El Norte Corpus (Dorr et al 2002) • Divergence Types • Categorial(X tener hambre  X be hungry) • Conflational(X dar puñaladas a Z  X stab Z) • Structural(X entrar en Y  X enter Y) • Head Swapping(X cruzar Y nadando  X swim across Y) • Thematic(X gustar a Y  Y like X)

  10. Roadmap • Introduction • Generation-Heavy Machine Translation • Evaluation • Conclusion • Future Work

  11. Generation-Heavy Hybrid Machine Translation • Problem: asymmetric resources • High quality, broad coverage, semantic resources for target language • Low quality resources for source language • Low quality (many-to-many) translation lexicon • Thesis: we can approximate interlingual MT without the use of symmetric interlingual resources

  12. Relevant Background Work • Hybrid Natural Language Generation Constrained Overgeneration  Statistical Ranking Nitrogen (Langkilde and Knight 1998), Halogen (Langkilde 2002) FERGUS (Rambow and Bangalore 2000) • Lexical Conceptual Structure (LCS) based MT (Jackendoff 1983), (Dorr 1993)

  13. LCS-based MTExample (Dorr, 1993)

  14. Theta Linking Expansion Assignment Linearization Pruning Ranking … Generation-Heavy HybridMachine Translation Generation Analysis Translation

  15. EXERGE Generation ExpansiveRich Generation for English Theta Linking Expansion Assignment Linearization Pruning Ranking MatadorSpanish-English GHMT Spanish Analysis Translation English

  16. dar :subj :mod :obj Yo puñalada a :obj Juan GHMTAnalysis • Source language syntactic dependency • Example: Yo le di puñaladas a Juan. • Features of representation • Approximation of predicate-argument structure • Long-distance dependencies

  17. dar :subj :mod ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER :obj Yo puñalada a :subj :mod :obj :obj I, MY, MINE STAB, KNIFE_WOUND AT, BY, INTO, THROUGH, TO Juan :obj JOHN GHMTTranslation • Lexical transfer but NO structural change • Translation Lexicon (tener V)((have V) (own V) (possess V) (be V))(deber V)((owe V) (should AUX) (must AUX))(soler V)((tend V) (usually AV)) 

  18. EXTEND, GIVE, GRANT, RENDER Goal ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER Agent Theme I, MY, MINE STAB, KNIFE_WOUND :subj JOHN :mod :obj I, MY, MINE STAB, KNIFE_WOUND AT, BY, INTO, THROUGH, TO :obj JOHN GHMTThematic Linking • Syntactic Dependency  Thematic Dependency • Which divergence

  19. GHMTThematic Linking Resources • Word Class Lexicon :NUMBER "V.13.1.a.ii" :NAME "Give - No Exchange” :POS V :THETA_ROLES (((agobl) (thobl) (goalobl to)) ((agobl) (goalobl) (thobl))) :LCS_PRIMS (cause go) :WORDS (feed give pass pay peddle refund render repay serve)) • Syntactic-Thematic Linking Map (:subj ag instr th exp loc src goal perc mod-poss poss) (:obj2 goal src th perc ben) (across  goal loc) (in  loc mod-poss perc goal poss prop) (to prop goal ben info th exp perc pred loc time)

  20. ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER :subj :mod :obj I, MY, MINE STAB, KNIFE_WOUND AT, BY, INTO, THROUGH, TO :obj JOHN GHMTThematic Linking • Syntactic Dependency  Thematic Dependency ((ADMINISTER V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (CONFER V.37.6.b ((EXP OBL))) (DELIVER V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (EXTEND V.47.1 ((TH OBL) (MOD-LOC OPT . T))) (EXTEND V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (EXTEND V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (EXTEND V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GIVE V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (GIVE V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (GRANT V.29.5.e ((AG OBL) (INFO OBL THAT))) (GRANT V.29.5.d ((AG OBL) (TH OBL) (PROP OBL TO))) (GRANT V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GRANT V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (HAND V.11.1 ((AG OBL) (TH OBL) (GOAL OPT TO) (SRC OPT FROM))) (HAND V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (LAND V.9.10 ((AG OBL) (TH OBL))) (RENDER V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (RENDER V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (RENDER V.10.6.a ((AG OBL) (TH OBL) (MOD-POSS OPT OF))) (RENDER V.10.6.a.LOCATIVE ((AG OPT) (SRC OBL) (TH OPT OF))))

  21. ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER :subj :mod :obj I, MY, MINE STAB, KNIFE_WOUND AT, BY, INTO, THROUGH, TO :obj JOHN GHMTThematic Linking • Syntactic Dependency  Thematic Dependency ((ADMINISTER V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (CONFER V.37.6.b ((EXP OBL))) (DELIVER V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (EXTEND V.47.1 ((TH OBL) (MOD-LOC OPT . T))) (EXTEND V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (EXTEND V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (EXTEND V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GIVE V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (GIVE V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (GRANT V.29.5.e ((AG OBL) (INFO OBL THAT))) (GRANT V.29.5.d ((AG OBL) (TH OBL) (PROP OBL TO))) (GRANT V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GRANT V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (HAND V.11.1 ((AG OBL) (TH OBL) (GOAL OPT TO) (SRC OPT FROM))) (HAND V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (LAND V.9.10 ((AG OBL) (TH OBL))) (RENDER V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (RENDER V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (RENDER V.10.6.a ((AG OBL) (TH OBL) (MOD-POSS OPT OF))) (RENDER V.10.6.a.LOCATIVE ((AG OPT) (SRC OBL) (TH OPT OF))))

  22. EXTEND, GIVE, GRANT, RENDER Goal ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER Agent Theme I, MY, MINE STAB, KNIFE_WOUND :subj JOHN :mod :obj I, MY, MINE STAB, KNIFE_WOUND AT, BY, INTO, THROUGH, TO :obj JOHN GHMTThematic Linking • Syntactic Dependency  Thematic Dependency

  23. Categorial Variation Node Conflation / Inflation  developmentN developV putV  butterV butterN enter enter go subj obj subj subj in in John John John room room room Interlingua Approximationthrough Expansion Operations RelationConflation / Inflation Relation Variation  

  24. Interlingua Approximation2nd Degree Expansion cross go mod mod subj subj obj across John swimming John swimming river river Relation Inflation swim across subj John river Node Conflation

  25. GIVEV STABV Agent Goal Agent Goal Theme I JOHN I STABN JOHN GHMTStructural Expansion • Conflation Example ,

  26. GHMTStructural Expansion • Conflation and Inflation • Structural Expansion Resources • Word Class Lexicon :NUMBER "V.42.2" :NAME “Poison Verbs” :POS V :THETA_ROLES (((ag obl)(goal obl))) :LCS_PRIMS (cause go) :WORDS (crucify electrocute garrotte hang knife poison shoot smother stab strangle) • Categorial Variation Database(Habash and Dorr 2003) (:V (hunger) :N (hunger hungriness) :AJ (hungry)) (:V (validate) :N (validation validity) :AJ (valid)) (:V (cross) :N (crossing cross) :P (across)) (:V (stab) :N (stab))

  27. GIVEV Goal Agent Theme I STABN JOHN STABV GHMTStructural Expansion • Conflation Example

  28. GIVEV STABV [CAUSE GO] [CAUSE GO] Agent Agent Goal Goal Theme * * I STABN JOHN GHMTStructural Expansion • Conflation Example

  29. GIVEV STABV Agent Goal Agent Goal Theme I JOHN I STABN JOHN GHMTStructural Expansion • Conflation Example ,

  30. GIVEV STABV GIVEV STABV Subject Agent Goal Agent IObject Object Goal Theme Subject Object I, MY … STABN, KNIFE_ WOUNDN JOHN I JOHN I STABN JOHN I, MY … JOHN GIVEV Subject Mod Object I, MY … STABN, KNIFE_ WOUNDN TO, AT, … Object JOHN GHMT Syntactic Assignment • Thematic  Syntactic Mapping

  31. GIVEV GIVEV STABV STABV Subject Subject IObject IObject Object Object Subject Subject Object Object I, MY … I STABN STABN, KNIFE_ WOUNDN JOHN JOHN I I, MY … JOHN JOHN GIVEV GIVEV Subject Subject Mod Mod Object Object I, MY … I STABN v STABN, KNIFE_ WOUNDN TO TO, AT, … Object Object JOHN JOHN GHMT Structural N-gram Pruning • Statistical lexical selection

  32. every every cloud cloud have has lining lining silver silver a a GHMTTarget Statistical Resources • Structural N-gram Model • Long-distance • Lexemes • Surface N-gram Model • Local • Surface-forms

  33. GHMTLinearization &Ranking • Oxygen Linearization (Habash 2000) • Halogen Statistical Ranking (Langkilde 2002) --------------------------------------------------------- I stabbed John . [-1.670270 ] I gave a stab at John . [-2.175831] I gave the stab at John . [-3.969686] I gave an stab at John . [-4.489933] I gave a stab by John . [-4.803054] I gave a stab to John . [-5.045810] I gave a stab into John . [-5.810673] I gave a stab through John . [-5.836419] I gave a knife wound by John . [-6.041891]

  34. Roadmap • Introduction • Generation-Heavy Machine Translation • Evaluation • Overall Evaluation • Component Evaluation • Conclusion • Future Work

  35. Overall EvaluationSystems (Resnik 1997) (Brown et al 1990)(Al-Onaizan et al 1999)(Germann and Marcu 2000)

  36. Overall EvaluationBleu Metric • Bleu • BiLingual Evaluation Understudy (Papineni et al 2001) • Modified n-gram precision with length penalty • Quick, inexpensive and language independent • Correlates highly with human evaluation • Bias against synonyms and inflectional variations

  37. Overall EvaluationTest Sets

  38. Overall EvaluationResults

  39. Overall EvaluationResults • Systran is overall best • Gist is overall worst • Matador is more robust than IBM4 • Matador is more grammatical than IBM4 • Matador has less information loss than IBM4

  40. Overall EvaluationGrammaticality • Example • SP: Ademàs dijo que solamente una inyecciòn masiva de capital extranjero ... • EN: Further, he said that only a massive injection of foreign capital ... • IBM4: further stated that only a massive inyecciòn of capital abroad ... • MTDR: Also he spoke only a massive injection of foreign capital ... • Parsed all sentences (Spanish, English reference and English output) • Can we find main verb? • Pro Drop Restoration

  41. Overall EvaluationGrammaticality: Verb Determination

  42. Overall EvaluationGrammaticality: Subject Realization

  43. Overall EvaluationLoss of Information • Example • SP: El daño causado al pueblo de Sudáfrica jamás debe subestimarse. • EN: The damage caused to the people of his country should never be underestimated. • IBM4: the damage * the people of south * must never underestimated . • MTDR: Never the causado damage to the people of South Africa should be underestimated.

  44. Component Evaluation • Conducted several component evaluations • Parser • ~75% correct (labeled dependency links) • Categorial Variation Database • 81% Precision-Recall • Structural Expansion • Structural N-grams

  45. Component EvaluationStructural Expansion • Insignificant increase in Bleu score • 40% of divergences pragmatic • LCS lexicon coverage issues • Minimal handling of nominal divergences • Over-expansion • Además, destruyó totalmente sus cultivos de subsistencia … • EN: It had totally destroyed Samoa's staple crops ... • MTDR: Furthermore, it totaled their cultivations of subsistence … • SP: Dicha adición se publica sólo en años impares. • EN: That addendum is issued in odd-numbered years only. • MTDR: concerned addendum is excluded in odd years.

  46. Component EvaluationStructural N-grams • 60% speed-up with no effect on quality

  47. Roadmap • Introduction • Generation-Heavy Machine Translation • Evaluation • Conclusion • Future Work

  48. ConclusionResearch Contributions • A general reusable and extensible MT model that transcends the need for large amounts of symmetric knowledge • A systematic non-interlingual/non-transfer framework for handling translation divergences • Extending the concept of symbolic overgeneration to include conflation and head-swapping of structural variations. • A model for language-independent syntactic-to-thematic linking

  49. ConclusionResearch Contributions • Development of reusable large-scale modules and resources: Exerge, Categorial Variation Database, etc. • A large-scale Spanish-English GHMT implementation • An evaluation of Matador against four models of machine translation found it to be robust across genre and to produce more grammatical output.

  50. Ongoing Work • Retargetability to new languages • Chinese, Arabic • Extending system to use bi-texts • Phrase dictionary • Weighted translation pairs • Generation-Heavy parsing • Small dependency grammar for foreign language • English structural n-grams to rank parses • Extending system with new optional modules • Cross-lingual headline generation DepTrimmer (work with Bonnie Dorr) extending Trimmer (Dorr, et al. 2003) to dependency representation

More Related