90 likes | 212 Vues
This project aims to create a framework for high-accuracy machine translation (MT) of extracted entities, objects, and their relationships from Arabic text. The approach focuses on rapid adaptability to new source languages and scalable entity types. Key methods include designing a targeted elicitation corpus, developing generalized transfer rules for extraction, and using partial parsing techniques for translating matched portions of the source language text. The system will be evaluated against standard MT methods to measure precision, recall, and F1 scores for entities and relationships.
E N D
AMTEXT:Extraction-based MT for Arabic Alon Lavie, Jaime Carbonell Language Technologies Institute Carnegie Mellon University Email: {alavie,jgc}@cs.cmu.edu Project Members: Laura Kieras, Peter Jansen Informant: Loubna El Abadi
Objective • Develop a framework for high-accuracy MT of extracted entities, objects and their relationships, which is: • Rapidly portable and adaptable to new source languages • Easily expandable to new types of entities and relationships ITIC MT Integration Meeting
AMTEXT Approach • Develop an elicitation corpus specifically designed for targeted extraction patterns • Learn generalized transfer rules for targeted extraction patterns from elicitation corpus • Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary • Runtime: use partial parser + transfer rules to translate only the matched portions of SL text ITIC MT Integration Meeting
Elicitation Example ITIC MT Integration Meeting
Learning Transfer Rules • Different notion of rule generalization than in our full XFER approach • Generalize from examples to NEs that play specific roles in target extraction pattern • Verbs and function words may not be generalized • Example: Peres will meet with Bush today peres yipagesh &im bush hayom Goal Rule: S::S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE]((X1::Y1) (X4::Y5) (X5::Y6)) ITIC MT Integration Meeting
Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Handle expected high-levels of ambiguity Peres, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Peres will meet with Bush today ITIC MT Integration Meeting
Input/Output • Input: • Full text in source language (Arabic) • Output: • English translation of extracted entities and relationships • (Possibly also a structured representation) أعلنت صحيفة القدس العربي ومقرها لندن أنها تلقت الأحد بيانا يتبنى فيه تنظيم القاعدة بزعامة أسامة بن لادن الهجومين اللذين استهدفا كنيسين يهوديين في إسطنبول واللذين أسفرا عن مقتل 23 شخصا وإصابة 300 آخرين. وهدد البيان بتوجيه مزيد من الضربات للولايات المتحدة وحلفائها في جميع أنحاء العالم. The Abu Hafz al-Masri Brigades - al-Qaida warned car bombs killed 23 people injured 300 others AMTEXT System ITIC MT Integration Meeting
Scope of Pilot System • Arabic-to-English • Newswire text (available from TIDES) • Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… • Limited translation patterns: • <subj-NE> <verb> <obj> <LOC>* <TE>* • Limited vocabulary ITIC MT Integration Meeting
Evaluation Plan • Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage • Establish a test set for evaluation • Define an appropriate metric: Precision/Recall/F1 of relations and entities • Compare performance ITIC MT Integration Meeting