Statistical Methods for Translating Collocations in Bilingual Lexicons

Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna

Translating Collocations for Bilingual Lexicons: A Statistical Approach Frank Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou CL-1996

Overview – Champollion • Translates collocations from English into French using an aligned corpus (Hansards) • The translation is constructed incrementally, adding one word at a time • Correlation method: the Dice coefficient • Accuracy between 65% and 78%

The Similarity Measure • Dice coefficient (Dice, 1945) where p(X,Y),p(X), and p(Y) are the joint and marginal probability of X and Y • If the probabilities are estimated using maximum likelihood, then where fX,fY, and fXY are the absolute frequencies of appearance of “1”s for X andY

Algorithm - Preprocessing • Source and target language sentences must be aligned (Gale and Church 1991) • List of collocations to be translated must be provided (Xtract, Smadja 1993)

Algorithm 1/3 • Champollion identifies a set S of k words highly correlated with the source collocation • The target collocation is in the powerset of S • These words have a Dice-measure  Td ( = 0.10) and appear  Tf ( = 5 ) times • Form all pairs of words from S • Evaluate the correlation between each pair and the source collocation (Dice)

Algorithm 2/3 • Keep pairs that score above the threshold Td • Construct 3–word elements containing one of the highly correlated pairs plus a member of S • … • Until for some n ≤ k, no n–word scores above the threshold

Algorithm 3/3 • Champollion selects the best translation among the top candidates • In case of ties, the longer collocation is preferred • Determine whether the selected translation is a single word, a flexible, or a rigid collocation, in case of multiword translations • Are the words used consistently in the same order and at the same distance?

Experimental Setup • DB1 = 3.5*106 words (8 months of 1986) • DB2 = 8.5*106 words (1986 and 1987) • C1 = 300 collocations from DB1 of mid-range frequency • C2 = 300 collocations from 1987 • C3 = 300 collocations from 1988 • Three fluent bilingual speakers • Canadian French vs. continental French

Results

Future Work • Translating the closed class words • Tools for the target language • Separating corpus-dependent translations from general ones • Handling low frequency collocations • Analysis of the effects of thresholds • Incorporating the length of the translation into the score • Using nonparallel corpora

Comments

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Pascal Fung ACL-1995

Goal of the Paper • Create bilingual lexicon of nouns and proper nouns • From unaligned, noisy parallel texts of Asian/Indo-European language pairs • Pattern matching method

Introduction • Previous research on sentence-aligned, parallel texts • Alignment not always practical • Unclear sentence boundaries in corpora • Noisy text segments present in only one language • Two main steps • Find small bilingual primary lexicon • Compute a better secondary lexicon from these partially aligned texts

Algorithm • Tag the English half of the parallel text • Nouns and proper nouns (they have consistent translations over the entire text) • Tagged English part with a modified POS tagger • Find translations for nouns, plural nouns and proper nouns only

Algorithm • Positional Difference Vectors • Correspondence between a word and its translated counterpart • In their frequency • In their positions • Correspondence need not be linear • Calculation • p – position vector of a word • V – positional difference vector • V[i-1] = p[i] – p[i-1]

Algorithm

Algorithm • Match pairs of positional difference vectors, giving scores • Dynamic Time Warping (Fung & McKeown, 1994) • For non-identical vectors • Trace correspondence between all points in V1 and V2 • No penalty for deletions and insertions • Statistical filters

Dynamic Time Warping • Given V1 and V2, which point in V1 corresponds to which point in V2?

Algorithm

Algorithm • Finding anchor points and eliminating noise • Every word pair selected to run DTW • Obtain DTW score • Obtain DTW path • Plot DTW paths of all such word pairs • Keep highly reliable points and discard rest • Point (i,j) is noise if

Algorithm

Algorithm • Finding low frequency bilingual word pairs • Non-linear segment binary vectors • V1[i] = 1 if word occurs in ith segment • Binary vector correlation measure

Results

Comments

Automated Dictionary Extraction for “Knowledge-Free” Example-Based Translation Ralf D. Brown TMIMT-1997

Goal of the Paper • Extract a bilingual dictionary • Using a aligned bilingual corpus • Perform tests to compare the performance of PanEBMT using • Collins Spanish-English dictionary + WordNet English root/synonym list • Various automatically extracted bilingual dictionaries

Introduction

Extracting Bilingual Dictionary • Extracted from corpus using • Correspondence table • Threshold Schema • Correspondence Table • Two dimensional array • Indexed by source language words • Indexed by target language words • Cross-product word entries of each sentence pair are incremented

Extracting Bilingual Dictionary • Similar word orders language pairs biased • Threshold setting • A step function • Unreachably high for co-occurrence < MIN • Constant otherwise • A sliding scale • Start at 1.0 for co-occurrence = 1 • Slide smoothly to MIN threshold value

Extracting Bilingual Dictionary • Filtering • Symmetric threshold • Asymmetric threshold • Any elements of Correspondence table which fail both tests set to zero • Non-zero elements added to dictionary

Extracting Bilingual Dictionary - Results

Extracting Bilingual Dictionary - Errors • High-frequency Error-ridden terms • Short list high frequency words (all words which appear in at least 20% of source sentences) • Short list sentence pairs containing extactly one or two high frequency words • Results in 7 of 16 words – Zero error • Merge with results from first pass

Experimental Setup • Manually created tokenization – 47 equivalence classes, 880 words and translations of each word • Two test texts • 275 UN corpus sentences : in-domain • 253 Newswire sentences : out-of-domain

Results

Comments

Extracting Paraphrases from a Parallel Corpus Regina Barzilay and Kathleen R. McKeown ACL-2001

Overview • Corpus-based unsupervised learning algorithm for paraphrase extraction • Lexical paraphrases (single and multi-word) • (refuse, say no) • Morpho-syntactic paraphrases • (king’s son, son of the king) • (start to talk, start talking) • Phrases which appear in similar contexts are paraphrases

Data • Multiple English translations of literary texts written by foreign authors • Madam Bovary, Fairy Tales, Twenty Thousand Leagues Under the Sea, etc. • 11 translations

Preprocessing • Sentence alignment • Translations of the same source contain a number of identical words • 42% of the words in corresponding sentences are identical (average) • Dynamic programming (Gale & Church, 1991) • 94.5% correct alignments (127 sentences) • POS tagger and chunker  NP and VP

Algorithm – Bootstrapping • Co-training method: DLCoTrain (Collins & Singer, 1999) • Similar contexts surround two phrases  paraphrase • Having good paraphrase predictor contexts  new paraphrases • Analyze contexts surrounding identical words in aligned sentence pairs • Use these contexts to learn new paraphrases

Feature Extraction • Paraphrase features • Lexical: tokens for each phrase in the paraphrase pair • Syntactic: POS tags • Contextual features: left and right syntactic contexts surrounding the paraphrase (POS n-grams) tried tocomfortherleft1=“VB1 TO2”, right1=“PRP$3” tried toconsoleher left2=“VB1 TO2”, right2=“PRP$3”

Algorithm • Initialization • Identical words are the seeds (positive paraphrasing examples) • Negatives are created by pairing each word with all the other words in the sentence • Training of the context classifier • Record contexts around positive and negative paraphrases of length ≤ 3 • Identify the strong predictors based on their strength and frequency

Algorithm • Keep the most frequent k = 10 contexts with a strength > 95% • Training of the paraphrasing classifier • Using the context rules extracted previously, derive new pairs of paraphrases • When no more paraphrases are discovered, stop

Results • 9483 paraphrases, 25 morpho-syntactic rules • Out of 500: 86.5% (without context), 91.6% (with context) correct paraphrases • 69% recall evaluated on 50 sentences

Future Work • Extract paraphrases from comparable corpora (news reports about the same event) • Improve the context representation

Comments

Thank You !

Statistical Methods for Translating Collocations in Bilingual Lexicons