490 likes | 606 Vues
This paper explores statistical approaches to translating collocations between English and French using an aligned corpus, specifically the Hansards. It details the incremental translation process, employing the Dice coefficient to measure correlation and accuracy rates between 65% and 78%. The algorithms involve preprocessing aligned sentences and identifying high-correlation word sets for accurate translation. Results indicate advancements in handling low-frequency collocations and the integration of translation length into evaluation. Future work aims at refining techniques for a more comprehensive bilingual lexicon.
E N D
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna
Translating Collocations for Bilingual Lexicons: A Statistical Approach Frank Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou CL-1996
Overview – Champollion • Translates collocations from English into French using an aligned corpus (Hansards) • The translation is constructed incrementally, adding one word at a time • Correlation method: the Dice coefficient • Accuracy between 65% and 78%
The Similarity Measure • Dice coefficient (Dice, 1945) where p(X,Y),p(X), and p(Y) are the joint and marginal probability of X and Y • If the probabilities are estimated using maximum likelihood, then where fX,fY, and fXY are the absolute frequencies of appearance of “1”s for X andY
Algorithm - Preprocessing • Source and target language sentences must be aligned (Gale and Church 1991) • List of collocations to be translated must be provided (Xtract, Smadja 1993)
Algorithm 1/3 • Champollion identifies a set S of k words highly correlated with the source collocation • The target collocation is in the powerset of S • These words have a Dice-measure Td ( = 0.10) and appear Tf ( = 5 ) times • Form all pairs of words from S • Evaluate the correlation between each pair and the source collocation (Dice)
Algorithm 2/3 • Keep pairs that score above the threshold Td • Construct 3–word elements containing one of the highly correlated pairs plus a member of S • … • Until for some n ≤ k, no n–word scores above the threshold
Algorithm 3/3 • Champollion selects the best translation among the top candidates • In case of ties, the longer collocation is preferred • Determine whether the selected translation is a single word, a flexible, or a rigid collocation, in case of multiword translations • Are the words used consistently in the same order and at the same distance?
Experimental Setup • DB1 = 3.5*106 words (8 months of 1986) • DB2 = 8.5*106 words (1986 and 1987) • C1 = 300 collocations from DB1 of mid-range frequency • C2 = 300 collocations from 1987 • C3 = 300 collocations from 1988 • Three fluent bilingual speakers • Canadian French vs. continental French
Future Work • Translating the closed class words • Tools for the target language • Separating corpus-dependent translations from general ones • Handling low frequency collocations • Analysis of the effects of thresholds • Incorporating the length of the translation into the score • Using nonparallel corpora
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Pascal Fung ACL-1995
Goal of the Paper • Create bilingual lexicon of nouns and proper nouns • From unaligned, noisy parallel texts of Asian/Indo-European language pairs • Pattern matching method
Introduction • Previous research on sentence-aligned, parallel texts • Alignment not always practical • Unclear sentence boundaries in corpora • Noisy text segments present in only one language • Two main steps • Find small bilingual primary lexicon • Compute a better secondary lexicon from these partially aligned texts
Algorithm • Tag the English half of the parallel text • Nouns and proper nouns (they have consistent translations over the entire text) • Tagged English part with a modified POS tagger • Find translations for nouns, plural nouns and proper nouns only
Algorithm • Positional Difference Vectors • Correspondence between a word and its translated counterpart • In their frequency • In their positions • Correspondence need not be linear • Calculation • p – position vector of a word • V – positional difference vector • V[i-1] = p[i] – p[i-1]
Algorithm • Match pairs of positional difference vectors, giving scores • Dynamic Time Warping (Fung & McKeown, 1994) • For non-identical vectors • Trace correspondence between all points in V1 and V2 • No penalty for deletions and insertions • Statistical filters
Dynamic Time Warping • Given V1 and V2, which point in V1 corresponds to which point in V2?
Algorithm • Finding anchor points and eliminating noise • Every word pair selected to run DTW • Obtain DTW score • Obtain DTW path • Plot DTW paths of all such word pairs • Keep highly reliable points and discard rest • Point (i,j) is noise if
Algorithm • Finding low frequency bilingual word pairs • Non-linear segment binary vectors • V1[i] = 1 if word occurs in ith segment • Binary vector correlation measure
Automated Dictionary Extraction for “Knowledge-Free” Example-Based Translation Ralf D. Brown TMIMT-1997
Goal of the Paper • Extract a bilingual dictionary • Using a aligned bilingual corpus • Perform tests to compare the performance of PanEBMT using • Collins Spanish-English dictionary + WordNet English root/synonym list • Various automatically extracted bilingual dictionaries
Extracting Bilingual Dictionary • Extracted from corpus using • Correspondence table • Threshold Schema • Correspondence Table • Two dimensional array • Indexed by source language words • Indexed by target language words • Cross-product word entries of each sentence pair are incremented
Extracting Bilingual Dictionary • Similar word orders language pairs biased • Threshold setting • A step function • Unreachably high for co-occurrence < MIN • Constant otherwise • A sliding scale • Start at 1.0 for co-occurrence = 1 • Slide smoothly to MIN threshold value
Extracting Bilingual Dictionary • Filtering • Symmetric threshold • Asymmetric threshold • Any elements of Correspondence table which fail both tests set to zero • Non-zero elements added to dictionary
Extracting Bilingual Dictionary - Errors • High-frequency Error-ridden terms • Short list high frequency words (all words which appear in at least 20% of source sentences) • Short list sentence pairs containing extactly one or two high frequency words • Results in 7 of 16 words – Zero error • Merge with results from first pass
Experimental Setup • Manually created tokenization – 47 equivalence classes, 880 words and translations of each word • Two test texts • 275 UN corpus sentences : in-domain • 253 Newswire sentences : out-of-domain
Extracting Paraphrases from a Parallel Corpus Regina Barzilay and Kathleen R. McKeown ACL-2001
Overview • Corpus-based unsupervised learning algorithm for paraphrase extraction • Lexical paraphrases (single and multi-word) • (refuse, say no) • Morpho-syntactic paraphrases • (king’s son, son of the king) • (start to talk, start talking) • Phrases which appear in similar contexts are paraphrases
Data • Multiple English translations of literary texts written by foreign authors • Madam Bovary, Fairy Tales, Twenty Thousand Leagues Under the Sea, etc. • 11 translations
Preprocessing • Sentence alignment • Translations of the same source contain a number of identical words • 42% of the words in corresponding sentences are identical (average) • Dynamic programming (Gale & Church, 1991) • 94.5% correct alignments (127 sentences) • POS tagger and chunker NP and VP
Algorithm – Bootstrapping • Co-training method: DLCoTrain (Collins & Singer, 1999) • Similar contexts surround two phrases paraphrase • Having good paraphrase predictor contexts new paraphrases • Analyze contexts surrounding identical words in aligned sentence pairs • Use these contexts to learn new paraphrases
Feature Extraction • Paraphrase features • Lexical: tokens for each phrase in the paraphrase pair • Syntactic: POS tags • Contextual features: left and right syntactic contexts surrounding the paraphrase (POS n-grams) tried tocomfortherleft1=“VB1 TO2”, right1=“PRP$3” tried toconsoleher left2=“VB1 TO2”, right2=“PRP$3”
Algorithm • Initialization • Identical words are the seeds (positive paraphrasing examples) • Negatives are created by pairing each word with all the other words in the sentence • Training of the context classifier • Record contexts around positive and negative paraphrases of length ≤ 3 • Identify the strong predictors based on their strength and frequency
Algorithm • Keep the most frequent k = 10 contexts with a strength > 95% • Training of the paraphrasing classifier • Using the context rules extracted previously, derive new pairs of paraphrases • When no more paraphrases are discovered, stop
Results • 9483 paraphrases, 25 morpho-syntactic rules • Out of 500: 86.5% (without context), 91.6% (with context) correct paraphrases • 69% recall evaluated on 50 sentences
Future Work • Extract paraphrases from comparable corpora (news reports about the same event) • Improve the context representation