330 likes | 688 Vues
Statistical Alignment and Machine Translation. 인공지능 연구실 정 성 원. Contents. Machine Translation Text Alignment Length-based methods Offset alignment by signal processing techniques Lexical methods of sentence alignment Word Alignment Statistical Machine Translation.
E N D
Statistical Alignment and Machine Translation 인공지능 연구실 정 성 원
Contents • Machine Translation • Text Alignment • Length-based methods • Offset alignment by signal processing techniques • Lexical methods of sentence alignment • Word Alignment • Statistical Machine Translation
Different Strategies for MT (1) Interlingua (knowledge representation) (knowledge-based translation) English (semantic representation) French (semantic representation) semantic transfer English (syntactic parser) French (syntactic parser) syntactic transfer English Text (word string) French Text (word string) word-for-word
Different Strategies for MT (2) • Machine Translation : important but hard problem • Why is ML Hard? • word for word • Lexical ambiguity • Different word order • syntactic transfer approach • Can solve problems of word order • Syntactic ambiguity • semantic transfer approaches • can fix cases of syntactic mismatch • Unnatural, unintelligible • interlingua
MT & Statistical Methods • In theory, each of the arrows in prior figure can be implemented based on a probabilistic model. • Most MT systems are a mix of prob. and non-prob. components. • Text alignment • Used to create lexical resources such as bilingual dictionaries and parallel grammars, to improve the quality of MT • More work on text alignment than on MT in statistical NLP.
Text Alignment • Parallel texts or bitexts • Same content is available in several languages • Official documents of countries with multiple official languages -> literal, consistent • Alignment • Paragraph to paragraph, sentence to sentence, word to word • Usage of aligned text • Bilingual lexicography • Machine translation • Word sense disambiguation • Multilingual information retrieval • Assisting tool for translator
Aligning sentences and paragraphs(1) • Problems • Not always one sentence to one sentence • Reordering • Large pieces of material can disappear • Methods • Length based vs. lexical content based • Match corresponding point vs. form sentence bead
Aligning sentences and paragraphs(3) S T • BEAD : n:m grouping • S, T : text in two languages • S = (s1, s2, … , si) • T = (t1, t2, … , tj) • 0:1, 1:0, 1:1, 2:1, 1:2, 2:2, 2:3, 3:2 … • Each sentence can occur in only one bead • No crossing s1 . . . . . . . si t1 . . . . . . . tj b1 b2 b3 b4 b5 . . bk
Dynamic Programming(2) • 가장 짧은 길 계산
Length-based methods • Rationale • Short sentence -> short sentence • Long sentence -> long sentence • Ignore the richer information but quite effective • Length • # of words or # of characters • Pros • Efficient (for similar languages) • rapid
Gale and Church (1) • Find the alignment A ( S, T : parallel texts ) • Decompose the aligned texts into a sequence of aligned beads (B1,…Bk) • The method • length of source and translation sentences measured in characters • similar language and literal translations • used for Union Bank of Switzerland(USB) Corpus • English, French, German • aligned paragraph level
Gale and Church (2) • D(i,j) : the lowest cost alignment between sentences s1,…,si and t1,…,tj
Gale and Church (3) L1 alignment 1 L1 alignment 2 L2 t1 cost(align(s1, t1)) S1 S2 S3 S4 + cost(align(s1, s2, t1)) t1 t2 cost(align(s2, t2)) + + cost(align(s3, )) cost(align(s3, t2)) t2 + cost(align(s3, t2)) t3 t3 cost(align(s4, t3))
Gale and Church (4) • l1, l2 : the length in characters of the sentences of each language in the bead • 두 언어 사이의 character의 길이 비 • normal distribution ~ (, s2) • average 4% error rate • 2% error rate for 1:1 alignments
Other Researches • Brown et.al(1991c) • 대상 : Canadian Hansard(English , French) • 방법 : Comparing sentence lengths in words rather than characters • 목적 : produce an aligned subset of the corpus • 특징 : EM algorithm • Wu(1994) • 대상 : Hong Kong Hansard(English, Cantonese) • 방법 : Gale and Church(1993) Method • 결과 : not as clearly met when dealing with unrelated language • 특징 : use lexical cues
Offset alignment by signal processing techniques • Showing roughly what offset in one text aligns with what offset in the other. • Church(1993) • 배경 : noisy text(OCR output) • 방법 • character sequence level에서 cognate정의 -> 순수한 cognate + proper name + numbers • dot plot method(character 4-grams) • 결과 : very small error rate • 단점 • different character set • no or extremely few identical character sequences
DOT-PLOT Uni-gram bi—gram
Fung and Mckeown • 조건 • without having found sentence boundary • in only roughly parallel texts • with unrelated language • 대상 : English and Cantonese • 방법 : • arrival vector • small bilingual dictionary • A word offset : (1,263,267,519) => arrival vector : (262,4,252). • Choose English, Cantonese word pairs of high similarity => small bilingual dictionary => anchor of text alignment • Strong signal in a line along the diagonal in dot plot => good alignment
Lexical methods of sentence alignment(1) • Align beads of sentences in robust ways using lexical information • Kay and Röscheisen(1993) • 특징 : lexical cues, a process of convergence • 알고리즘 • Set initial anchors • until most sentences are aligned • Form an envelope of possible alignments • Choose pairs of words that tend to co-occur in these potential partial alignment • Find pairs of source and target sentences which contain many possible lexical correspondences.
Lexical methods of sentence alignment(2) • 96% coverage after four passes on Scientific American articles • 7 errors after 5 passes on 1000 Hansard sentences • 단점 • computationally intensive • pillow shaped envelope => text moved, deleted
Lexical methods of sentence alignment(3) • Chen(1993) • Similar to the model of Gale and Church(1993) • Simple translation model is used to estimate the cost of a alignment. • 대상 • Canadian Hansard, European Economic Community proceedings.(millions of sent.) • Estimated error rate : 0.4 % • most of errors are due to sentence boundary detection method => no further improvement
Lexical methods of sentence alignment(4) • Haruno and Yamazaki(1996) • Align structurally different languages. • A variant of Kay and Roscheisen(1993) • Do lexical matching on content words only • POS tagger • To align short texts, use an online dictionary • Knowledge-rich approach • The combined methods • good results on even short texts between very different languages
Word Alignment • 용도 • terminology databases, bilingual dictionaries • 방법 • text alignment -> word alignment • χ2 measure • EM algorithm • Use of existing bilingual dictionaries
Language Model P(e) e Translation Model P(f/e) f Decoder ê = arg maxe P(e/f) ê Statistical Machine Translation(1) • Noisy channel model in MT • Language model • Translation model • Decoder
Statistical Machine Translation(2) • Translation model • compute p(f/e) by summing the probabilities of all alignments f e . . fj . . . . . . eaj . .. • e: English sentence • l : the length of e in words • f : French sentence • m : the length of f • fj : word j in f • aj : the position in e that fj is aligned with • eaj : the word in e that fj is aligned with • p(wf/we) : translation prob. • Z : normalization constant
Statistical Machine Translation(3) • Decoder • Translation probability : p(wf/we) • Assume that we have a corpus of aligned sentences. • EM algorithm search space is infinite => stack search
Statistical Machine Translation(4) • Problems • distortion • fertility : The number of French words one English word generate. • Experiment • 48% of French sentences were decoded correctly • incorrect decodings • ungrammatical decodings
Statistical Machine Translation(5) • Detailed Problems • model problems • Fertility is asymmetric • Independence assumption • Sensitivity to training data • Efficiency • lack of linguistic knowledge • No notion of phrase • Non-local dependencies • Morphology • Sparse data problems