160 likes | 279 Vues
This paper discusses a novel approach to training a dependency parser optimized for machine translation (MT) reordering tasks. It presents an overview of using targeted self-training algorithms to improve downstream performance, particularly for English to Japanese translation. The evaluation metrics include precision, recall, F1, accuracy, BLEU, and ROUGE, highlighting the intrinsic and extrinsic quality of parsed outputs. Experiments indicate that targeted self-training consistently results in better translation quality compared to baseline systems, demonstrating the importance of effective reordering and parsing in MT.
E N D
Training a Parser for Machine Translation Reordering Jason Katz-Brown, Slav Petrov, Ryan McDonald, Franz Och David Talbot, Hiroshi Ichikawa, Masakazu Seno, HidetoKazawa
Dependency Parsing • Given a sentence, label the dependencies • (from nltk.org) • Output is useful for downstream tasks like machine translation • Also of interest to NLP reaserchers
Overview of Paper • Motivation • Targeted Self Training Algorithm • MT experiments • Domain adaptation
Motivation - Evaluation • Intrinsic • How well does system replicate gold annotations? • Precision/recall/F1, accuracy, BLEU, ROUGE, etc. • Extrinsic • How useful is system for some downstream task? • High performance on one doesn’t necessarily mean high performance on the other • Can be hard to evaluate extrinsically
Motivation • Parsing is not a stand-alone task • Useful as part of a larger system • High-fidelity replication of gold parses won’t necessarily yield the best downstream performance • Try to train a model that will yield better downstream performance than a model trained to replicate gold standard • Maximize extrinsic quality, rather than intrinsic
Targeted Self Training Algorithm • For each sentence in a corpus • Parse sentence S with a baseline parser, get k-best • Choose the parse of S that optimizes some function F, add to training data • Retrain parser • F measures the extrinsic quality of the parse • Finding F can be challenging! • Standard self training: just choose 1-best
Reordering • Reordering is changing source language word order to target language word order • Here doing English (SVO) to Japanese (SOV) • Metrics that account for word order correlate better with human judgment than those that prefer word choice • Can use manually or automatically derived tree transforms to reorder • Reordering is useful as a preprocessing step
Reordering • Reordering is its own step • Function to evaluate reordering quality, given gold reordering: 1 – ((# chunks – 1) / (# words – 1)) • Chunks are contiguous spans in both predicted and gold • Prediction: A B E C D; Gold: A B C D E • 1 – ((3 -1 ) / (5 – 1))
Parsing and Reordering • Different parses yield different reorderings • Systems tend to be sensitive to errors
MT Experiment Setup • Train baseline Nivre dependency parser on WSJ (and Berkeley parser) • English/Japanese corpus with literal translations and manual word alignments • 6,268 training / 7,327 test sentences • Annotators need very little training • Makes this relatively cheap • Annotating dependency parses requires a lot of training
MT Experiment Setup • Use hand-crafted rules for reordering • Phrase-based MT system • Train parser in 3 ways: • Baseline • Standard self-training • Targeted self-training • Look at: • Labeled attachment score (LAS; intrinsic) • Reordering score • MT quality (BLEU and human)
Results • Evaluated MT quality with BLEU and humans • Varied the training of the dependency parser that feeds into reordering component • Experiments in Korean, Japanese and Turkish (all SOV languages) • In all cases BLEU and human opinion improves with targeted self training (10x) compared to baseline parser • Humans still put the translation quality in the “some meaning/grammar” range (~2.5/6) • Improvement is not drastic
Domain Adaptation Experiment • Use Question Treebank (QTB) to make MT system translate questions better than baseline system • Have 2k questions parsed • Have 2k questions translated and annotated for reordering • Compare translation output from system that includes parsers trained in different ways
Results • BLEU score and human opinion of Japanese translations of QTB test sentences was higher with targeted self training than with baseline parser • Gold QTB yielded better reordering score, but more expensive to produce than alignments • Didn’t report BLEU/human opinion on resulting translations