Domain Adaptation for Statistical Machine Translation

DomainAdaptation forStatistical Machine Translation University of Macau Master Defense ByLongyue WANG, Vincent MT Group, NLP2CT Lab, FST, UM Supervised by Prof. Lidia S. Chao, Prof. Derek F. Wong 20/08/2014

Research Scope • Domain-Specific • Statistical MT Figure 1: Our Research Scope [1] [2]

Agenda • Introduction • Proposed Method I: New Criterion • Proposed Method II: Combination • Proposed Method III: Linguistics • Domain-Specific Online Translator • Conclusion

Part I: Introduction

The First Question What is Statistical Machine Translation?

Statistical Machine Translation • SMT translations are generated on the basis of statistical models whose parameters are derived from the analysis of text corpora [3]. • Currently, the most successful approach of SMT is phrase-basedSMT, where the smallest translation unit is n-gram consecutive words. Figure 2: Phrase-based SMT Framework

Statistical Machine Translation Parallel Corpus Monolingual Corpus • Corpus is a collection of texts. e.g., IWSLT2012 official corpus. • Bilingual corpus is a collection of text paired with translation into another language. Monolingual corpus,in one (mostly are the target side) language. • Corpus may come from different genres, topics etc. Figure 2: Phrase-based SMT Framework

Statistical Machine Translation • Word alignment can be mined by the help of EM algorithm. • Then extract phrase pairs from word alignment to generate translation table. • Distance-based reordering model is a penalty of changing position of translated phrases. Word Alignment Translation Table Reordering Model Figure 2: Phrase-based SMT Framework

Statistical Machine Translation • Language model assigns a probability to a sequence of words. (n-gram) [4] Language Model Figure 2: Phrase-based SMT Framework (1)

Statistical Machine Translation Decoding function consists of three components: the phrase translation table, which ensure the foreign phrase to match target ones; reordering model, which reorder the phrases appropriately; and language model, which ensure the output to be fluent. Source Text Decoding Target Text Translation Candidates Searching Figure 2: Phrase-based SMT Framework (2)

The Second Question What is Domain-Specific SMT System?

Typical SMT vs. Domain-Specific SMT • Typical SMT systems are trained on a largeand broadcorpus (i.e., general-domain) and deal with texts with ignoring domain. • Performance depends heavily upon the qualityand quantityof training data. • Outputs preserve semantics of the source side but lack morphological and syntactic correctness. • Understandabletranslationquality. BBC News Example [5]. Input: Hollywood actor Jackie Chan has apologised over his son's arrest on drug-related charges, saying he feels "ashamed" and "sad". Google Output: 好萊塢影星成龍已經道歉了他兒子的被捕與毒品有關的指控，說他感覺“羞恥”和“悲傷”。

Typical SMT vs. Domain-Specific SMT • Domain-Specific SMT systems are trained on a smallbut relative corpus (i.e., in-domain) and deal with texts from one specific domain. • Consider relevancebetween training data and what we want to translate (test data). • Outputs preserve semantics of the source side, morphological and syntactic correctness. • Publishable quality. Patent Document Example [6] Input: 本发明涉及新的tetramic酸型化合物，它从CCR－5活性复合物中分离出来，在控制条件下通过将生物纯的微生物培养液(球毛壳霉Kunze SCH 1705 ATCC 74489)发酵来制备复合物。[5] ICONIC Translator Output: Noveltetramic acid-type compounds isolated from a CCR-5 active complex produced by fermentation under controlled conditions of a biologically pure culture of the microorganism, ChaetomiumglobosumKunze SCH 1705, ATCC 74489 ., pharmaceutical compositions containing the compounds.

The Third Question What is Domain-Specific Translation Challenge?

Challenge 1 – Ambiguity • Multi-meaningmay not coincide in bilingual environment. The English word Mouse refers to both animal and electronic device. While in the Chinese side, they are two words. Choosing wrong translation variants is a potential cause for miscomprehension. 1 2 Figure 3: Translation ambiguity example

Challenge 2 – Language Style News Domain • Try to deliver rich information with very economical language. • Short and simple-structure sentence make it easy to understand. • A lot of abbreviation, date, named entitles. China's Li Duihong won the women's 25-meter sport pistol Olympic gold with a total of 687.9 points early this morning Beijing time. (GuangmingDaily, 1996/07/02) 我国女子运动员李对红今天在女子运动手枪决赛中，以687.9环战胜所有对手，并创造新的奥运记录。（《光明日报》 1996年7月2日）

Challenge 2 – Language Style Law Domain • Very rigorous even with duplicated terms. • Use fewer pronouns, abbreviations etc. to avoid any ambiguity. • High frequency words of shall, may, must, be to. • Long sentence with long subordinateclauses. When an international treaty that relates to a contract and which the People’s Republic of China has concluded on participated into has provisions of the said treaty shall be applied, but with the exception of clauses to which the People’s Republic of China has declared reservation. 中华人民共和国缔结或者参加的与合同有关的国际条约同中华人民共和国法律有不同规定的,适用该国际条约的规定。但是,中华人民共和国声明保留的条款除外。

Challenge 3 – Out-Of-Vocabulary • Terminology: words or phrases that mainly occur in specific contexts with specific meanings. • Variants, increasing, combination etc. BHT 2,6-二叔丁基-4-甲基苯酚 8.36% 91.64% Figure 4: Out-of-Vocabulary Example

Domain Adaptation • As SMT is corpus-driven, domain-specificity of training data with respect to the test data is a significant factor that we cannot ignore. • There is a mismatchbetween the domain of available training data and the target domain. • Unfortunately, the training resources in specific domains are usually relatively scarce. In such scenarios, various domain adaptation techniques are employed to improve domain-specific translation quality by leveraging general-domain data.

Domain Adaptation for SMT Domain adaptation can be employed in different SMT components: word-alignment model, language model, translation model and reordering model. [6] [7] Model Figure 5: Domain Adaptation Approaches

Domain Adaptation for SMT Various resources can be used for domain adaptation: monolingual corpora, parallel corpora, comparable corpora, dictionaries and dictionary. [8] Resources Figure 5: Domain Adaptation Approaches

Domain Adaptation for SMT Considering supervision, domain adaptation approaches can be decided into supervised, semi-supervised and unsupervised. [9] Supervision Figure 5: Domain Adaptation Approaches

My Thesis • Data Selection: solve the ambiguity and language style problems by moving the data distribution of training corpora to target domain. • Domain Focused Web-Crawling: reduce the OOVs by mining in-domain dictionary, parallel and monolingual sentences from comparable corpus (web). Figure 6: My Domain Adaptation Approaches

Part II: Data Selection

SMT System … Definition Selecting data suitable for the domain at hand from large general-domain corpora, under the assumptionthat a general corpus is broad enough to contain sentences that are similar to those that occur in the domain. Spoken Domain Figure 7: Data Selection Definition

Framework – TM Adaptation Source Language Target Language Domain Estimation Target Language Source Language We define the set {<Si>, <Ti>, <Si,Ti>} as Vi. MRis an abstract model representing the target domain. Figure 8: My Data Selection Framework

Framework – TM Adaptation Source Language Target Language Source Language Target Language Domain Estimation • Rank sentence pairs according to score. • Select top K% of general-domain data. • K is a tunable threshold. Target Language Source Language Figure 8: My Data Selection Framework

Framework – TM Adaptation Source Language Target Language • Translation Model (IN) • Translation Model • (Final) Log-linear /linear Interpolation Source Language Target Language • Translation Model (Pseudo) Domain Estimation Target Language Source Language Figure 8: My Data Selection Framework

Framework – LM Adaptation Target Language • Language Model (IN) Target Language Domain Estimation • Language Model (Pseudo) Log-linear/Linear Interpolation Target Language • Language Model (Final) Figure 8: My Data Selection Framework

Framework – LM Adaptation Figure 8: My Data Selection Framework

Related Work Vector space model (VSM), which converts sentences into a term-weighted vector and then applies a vector similarity function to measure the domain relevance. The sentence Si is represented as a vector: Standard tf-idf weight: Each sentence Si is represented as a vector (wi1, wi2,…, win), and n is the size of the vocabulary. So wij is calculated as follows: Cosine measure: The similarity between two sentences is then defined as the cosine of the angle between two vectors. (3) (4) (5)

Related Work Perplexity-based model, which employs n-gram in-domain language models to score the perplexity of each sentence in general-domain corpus. • Cross-entropy is the average of the negative logarithm of the word probabilities. • Perplexitypp can be simply transformed with a base b with respect to which the cross-entropy is measured (e.g., bits or nats). • Perplexity and cross-entropy are monotonically related. (6) (7)

Related Work Until now, there are three perplexity-based variants: • The first basicone [13]: • The second is called Moore-Lewis [14]: which tries to select the sentences that are more similar to in-domain but different to out-of-domain. • The third is modified Moore-Lewis [15]: which considers both source and target language. (8) (9) (10)

Discussion: Grain Level By reviewing their work, I found • VSM-based methods can obtain about 1 BLEU point improvement using 60% of general-domain data [10, 11 and 12]. • Perplexity-based approaches allow to discard 50% - 99% of the general corpus resulted in an increase of 1.0 - 1.8 BLEU points [13, 14, 15, 16 and 17].

Discussion: Grain Level • VSM-based similarity is a simple co-occurrence based matching, which only weights single overlapping words. • Perplexity-basedsimilarity considers not only the distribution of terms but also the n-gram word collocation. • String-difference can comprehensively consider word overlap, n-gram collocation and word position. Figure 9: Data Selection Pyramid

The First Proposed Method Edit Distance: A New Data Selection Criterion for SMT Domain Adaptation

New Criterion String-difference metric is a better similarity function [21], with higher grain level. Edit-distance is proposed as a new selection criterion. Given a sentence sGfrom general-domain corpus and a sentence sIfrom in-domain corpus, the edit distance for these two sequences is defined as the minimum number of edits, i.e. symbol insertions, deletions and substitutions, needed to transform sG into sI. The normalized similarity score (fuzzy matching score, FMS) is given by Koehn and Senellart [22] in translation memory work. (11)

New Criterion For each sentence in general-domain corpus, we traverse all in-domain sentences to calculate FMS score and then average them. (12) General-domain Corpus In-domain Corpus • • • Figure 10: Edit-distance based data selection

Experiment: Corpora (Chinese-English) • General-domainparallel corpus (in-house) includes sentences comparing a various genres such as movie subtitles, law literature, news and novels. • In-domainparallel corpus, dev set, test set are randomly selected from the IWSLT2010 Dialog [37], consisting of transcriptions of conversational speech in travel. • We use parallel corpora for TM training and the target side for LM training. Table 1: Corpora Statistics (English-Chinese)

Experiment: System Setting • Baseline: SMT trained on all general-domain corpus; • VSM-based system (VSM): SMT trained on top K% of general-domain corpus ranked by Cosine tf-idfmetric; • Perplexity-based system (PL): SMT trained on top K% of general-domain corpus ranked by basic cross-entropymetric; • String-difference system(SD): SMT trained on top K% of general-domain corpus ranked by Edit-distancemetric; We investigate K={20, 40, 60, 80}% of ranked general-domain data as pseudo in-domain corpus for SMT training, where K% means K percentage of general corpus are selected as a subset.

Experiment: Results • Three adaptation methods do better than baseline. • VSMcan improve nearly 1 BLEU using 80% (more) entire data. • PLis a simple but effective method, which increases by 1.1 BLEU using 60% (less) data. • SDperforms best, which achieve higher BLEU than other two methods with lessdata. Table 2: Translation Quality of Adapted Models

Discussion • SD > PL > VSM > Baseline. • Higher grainedsimilarity metrics perform better than lower grainedones. • However, different grained level methods have their own advanced nature. • How about combining the individual models. Figure 9: Data Selection Pyramid

The Second Proposed Method A Hybrid Data Selection Model for SMT Domain Adaptation

Combination We investigate the combination of the above three individual models at two levels [23]. • Corpus level: weight the pseudo in-domain sub-corpora selected by different methods and then join them together. General-domain Corpus VSM Combined Corpus • • • General-domain Corpus ED Figure 11: Combination Approach

Combination • Model level: perform linear interpolation on the translation models trained on difference sub-corpora. where i = 1, 2, and 3 denoting the phrase translation probability and lexical weights trained on the VSM, perplexity and edit-distance’s subsets. αiand βiare the tunable interpolation parameters, subject to (13) (14)

Experiment: Corpora (Chinese-English) • General-domainparallel corpus includes sentences comparing a various genres such as movie subtitles, law literature, news and novels etc. • In-domainparallel corpus, dev set, test set are disjoinedly and randomly selected from LDC corpus [38] (Hong Kong law domain). Table 3: Translation Quality of Adapted Models

Experiment: Corpora (Chinese-English) • Corpus size, data-type distribution, in/gen domain ratio are different. • Data selection performance may be different. • We use parallel corpora for TM training and the target side for LM training. Table 4: Corpora Statistics

Experiment: System Setting • Baseline: the general-domain baseline (GC-Baseline) are respectively trained on entire general corpus. • Individual Model: Cosine tf-idf (Cos), proposed edit-distance based (ED) and three perplexity-based variants: cross-entropy (CE), Moore-Lewis (ML) and modified Moore-Lewis (MML). • Combined Model: combined Cos, ED and the best perplexity-based model at corpus level (iCPE-C) and model level (named iCPE-M). We report selected corpora in a step of 2x starting from using 3.75% of general corpus K={3.75, 7.5, 15, 30, 60}%.

Experiment: Individual Model Results • Perplexity-based variants are all effective methods. • MML performs best: improve highest (nearly 2 BLEU) with least data (15%). • MML> ED> CE > ML > Cos > Baseline Table 5: Translation Quality of Adapted Models

Domain Adaptation for Statistical Machine Translation

Domain Adaptation for Statistical Machine Translation

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation System

Morphological Preprocessing for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation Domain Adaptation

Statistical Machine Translation with Moses

Introduction to Statistical Machine Translation

Statistical Machine Translation Word Alignment

Statistical Machine Translation

Statistical Machine Translation

Morphological Processing for Statistical Machine Translation

Chinese Word Segmentation Adaptation for Statistical Machine Translation

Introduction to Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Cluster Computing for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation