1 / 21

AMTEXT: Extraction-based MT for Arabic

AMTEXT: Extraction-based MT for Arabic. Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi. Goals and Approach. Analysts often are looking for limited concrete information within the text  full MT may not be necessary

bailey
Télécharger la présentation

AMTEXT: Extraction-based MT for Arabic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMTEXT:Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

  2. Goals and Approach • Analysts often are looking for limited concrete information within the text  full MT may not be necessary • Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information • But – how do we extract just the relevant parts in the source language? • AMTEXT approach: • learn extraction patterns and their translations from smallamounts of human translated and aligned data • Combine with broad coverage Named-Entity translation lexicons • System output: translation of extracted information + a structured representation DoD KDL Visit

  3. AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Run Time Extract Transfer System Transfer Rules Filled Template Partial Parser & Transfer Engine S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE]((X1::Y1) (X4::Y4) (X5::Y5)) Post-processor Extractor Extracted Target Text NE Translation Lexicon Word Translation Lexicon DoD KDL Visit

  4. Elicitation Example DoD KDL Visit

  5. Learning Extraction Translation Patterns • Elicited example: Sharon nifgash hayom im bush Sharon met with Bush today • After Generalization: <PERSON> <MEET-V> <TE> im <PERSON> <PERSON> <MEET-V> with <PERSON> <TE> • Resulting Learned Pattern Rule: S::S : [PERSON MEET-V TE im PERSON] -> [PERSON MEET-V with PERSON TE] ( (X1::Y1) (X2::Y2) (X3::Y5) (X5::Y4)) DoD KDL Visit

  6. Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) DoD KDL Visit

  7. The Transfer Engine DoD KDL Visit

  8. Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Transfer Engine produces a complete lattice of transfer translations Sharon, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today DoD KDL Visit

  9. Post Processing • Translation Selection Module: • select most complete and coherent translation from lattice based on scoring heuristics • Structure Extraction: • Extract translated entities from the pattern and display in a structured table format • Output Display: • Perl scripts construct HTML page for displaying complete translation results DoD KDL Visit

  10. Translation Selection Module: Features • Goal: Scoring function that can identify the most likely best match • Lattice arc features from the transfer engine: • matched range of source • matched parts of target • transfer score • partial parse DoD KDL Visit

  11. Lattice Example Arafat to meet Peres in Brussels on Monday ErfAt yltqy byryz msAA AlAvnyn fy brwksl (1 1 "Arafat" 3 "ErfAt" "(PNAME,0 "Arafat")") (2 2 "will meet with" 3 "yltqy" "(MEET-V,5 "will meet with")") (3 3 "Peres" 3 "byryz" "(PNAME,1 "Peres")") (1 3 "Arafat will meet with Peres" 3 "ErfAt yltqy byryz" "((S,11 (PERSON,1 (PNAM E,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) ) )") (4 4 "msAA" 3 "msAA" "(UNK,0 "msAA")") (5 5 "Monday" 3 "AlAvnyn" "(DAY,0 "Monday")") (4 5 "on Monday" 2.9 "msAA AlAvnyn" "((TE,4 (LITERAL "on")(DAY,0 "Monday") ) )") (1 5 "Arafat will meet with Peres on Monday" 3.2 "ErfAt yltqy byryz msAA AlAvnyn " "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (P NAME,1 "Peres") ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 5 "Arafat will meet with Peres Monday" 3.1 "ErfAt yltqy byryz msAA AlAvnyn" " ((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAM E,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )") (6 6 "fy" 3 "fy" "(UNK,2 "fy")") (7 7 "Brussels" 3 "brwksl" "(PLACE,0 "Brussels")") (6 7 "in Brussels" 2.9 "fy brwksl" "((LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) )") (1 7 "Arafat will meet with Peres in Brussels on Monday" 3.4 "ErfAt yltqy byryz msAA AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will me et with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels" ) ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels Monday" 3.3 "ErfAt yltqy byryz msA A AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) (TE,5 (DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels" 3.2 "ErfAt yltqy byryz msAA AlAvn yn fy brwksl" "((S,8 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) ) )") DoD KDL Visit

  12. Example: Extracting Features • 1 5  Length (tokens) of source segment (ar) (1) • "Arafat will meet with Peres Monday"  length of trans segment (2) • 3.1  transfer engine score (3) • "ErfAt yltqy byryz msAA AlAvnyn"  length of source segment (4) • 1 2 3 4 5 • "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )"  Transfer structure - full frame (S) or not? (5) • Secondary feature (6): relative lengths of (2) over (4) : the smaller, the more concise the source language match (less extraneous material, i.e. less chance of mistranslation). DoD KDL Visit

  13. Selecting Best Translation For each parse Pj in the lattice, calculate a score Sj based on featuresfi with weight coefficients wi, as follows Weights wi trained by hill climbing (training set / manual reference parse) DoD KDL Visit

  14. “Proof-of-Concept” System • Arabic-to-English • Newswire text (available from TIDES) • Very limited set of actions: (X meet Y) • Limited collection of translation patterns: • <Person-NE> <meet-verb> <Person-NE> <LOC>* <TE>* • Limited vocabulary and NE lexicon DoD KDL Visit

  15. System Development • Training corpus of 535 short sentences translated and aligned by bilingual informant • 258 simple meeting sentences • 120 Temporal Expressions • 105 Location Expressions • 52 Title Expressions • Translation Lexicon of Names Entities (person names, organizations and locations) converted from Fei Huang’s NE translation/transliteration work • Pattern Generalizations semi-automatically “learned” from the training data • Patterns manually enhanced with “skipping markers” • Initial System integrated • Development with informant on 74 sentence dev data DoD KDL Visit

  16. Resulting System • Transfer Grammar contains: • 21 transfer pattern rules • 12 Meet Verb rules • 4/17/11/17 Person/TE/LOC/PTitle “high-level” rules • Transfer Lexicon contains 3070 entries (mostly names and locations) • Estimated development effort/time: • ~20 hours with informant • ~50 hours of lexical and rule development DoD KDL Visit

  17. Evaluation • Development set of 74 sentences • Test set of 76 unseen sentences with meeting information • Identified subset of each set on which meeting patterns could potentially apply (“Good”) • 53 development sentences • 44 test sentences DoD KDL Visit

  18. Evaluation • Translation-based: • Unigram token-based retrieval metrics: precision / recall / F1 • Entity-based: • Recall for each role in the meeting frame (V, P1, P2, LOC and TE) • Partial recall credit for partial matches • Partial credit (50%) for P1/P2 role interchange DoD KDL Visit

  19. Evaluation Results DoD KDL Visit

  20. Demonstration http://www-2.cs.cmu.edu/afs/cs/user/alavie/Avenue/tmp/demo20sep/met.dev.htm DoD KDL Visit

  21. Conclusions • Attractive methodology for joint extraction + translation of Essential Elements of Information from full foreign language texts • Rapid Development - circumvents need for developing high-quality full MT or high-quality IE technology for the foreign source language • Effective use of bilingual informants • Main Open Question – Scalability • Can this methodology be effective with much broader and more complex types of extracted EEIs? • Is automatic learning of generalized patterns feasible and effective in such more complex scenarios? • Can the selection heuristics effectively cope with the vast amounts of ambiguity expected in a large scale system? DoD KDL Visit

More Related