Parallel Data & Sentence Alignment

Parallel Data & Sentence Alignment Declan Groves, DCU dgroves@computing.dcu.ie

Parallel Corpus • Seen why from week 2 that data-driven Machine Translation (MT), based on using real-world translation examples/data, is now the most prevelant approach • 3 approaches to data-driven MT: • Example Based (EBMT) • Statistical Based (SMT) • Hybrid models (a mix of different approaches; may include non-data driven approaches such as rule-based) which use some probabilistic processing • All need a parallel copus (or bitext) of aligned sentences • Can create the resource manually, otherwise if we have un-aligned bilinugal texts, we can automate the alignment

Automatic Alignment (1/2) • Most alignments are one-to-one: E1: Often, in the textile industry, businesses close their plant in Montreal to move to the Eastern townships. F1: Dans le domaine du textile souvent, dans Montréal, on ferme et on va s’installer dans les Cantons de l’Est. E2: There is no legislation to prevent them from doing so, for it is a matter of internal economy. F2: Il n’ya aucune loi pour empêcher cela, c’est de la r´egie interne. E3: That is serious. F3: C’est grave.

Automatic Alignment (2/2) • But not always: E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable. F1: Les deputés d’en face se moquent du gel que a propose notre parti. F2: Pour eux, c’est une mesure risible.

Automatic Alignment (2/2) • But not always: • So some (like this) have one sentence correspond to two, or more, or none. Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences. E1: Honourary members opposite scoff at the freeze suggested by this party; to them it is laughable. F1: Les deputés d’en face se moquent du gel que a propose notre parti. F2: Pour eux, c’est une mesure risible. { F1 E1 F2

Good Language Models Come from Good Copora (1/2) • Any statistical approach to MT requires the availability of aligned bilingual corpora which are: • large; • good-quality; • representative Class Question • Assmume the following (tiny) corpus: • Q1: what’s P(have) vs. P(has) in a general particular corpus? Which is more likely? • Q2: what’s P(have | John) vs. P(has | John) in a general corpus? • Q3: what’s P(have) vs. P(has) in this corpus? What’s their relative probability? • Q4: what’s P(have | John) vs. P(has | John) in this corpus? * SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Jurafsky & Martin) Mary and John have two children. The children that Mary and John have are aged 3 and 4. John has blue eyes.

Good Language Models Come from Good Copora (2/2) • Assume a different small corpus: Am I right, or am I wrong? Peter and I are seldom wrong. I am sometimes right. Sam and I are often mistaken. • Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred?

Good Language Models Come from Good Copora (2/2) • Assume a different small corpus: Am I right, or am I wrong? Peter and I are seldom wrong. I am sometimes right. Sam and I are often mistaken. • Q5: What two generalisations would a probabilistic language model (based on bigrams, say) infer from this data, which are not true of English as a whole? Are there any other generalisations that could be inferred? • Q6: Try to think of some trigrams (and 4-grams, if you can) that cannot be ‘discovered’ by a bigram model? What you’re looking for here is a phrase where thethird (or subsequent) word depends on the first word, which in a bigram model is ‘too far away’... • Note that all the sentences in these corpora are well-formed. If, on the other hand, the corpus contains ill-formed input, then that too will skew our probability models ...

Bilingual Corpora (1/2) • Previous examples, all monolingual, but the same applies w.r.t. bilingual corpora. • One issue that we can think about now is what corpora we’re going to extract our probabilistic language (and translation) models from. What sorts of large, good quality, representative bilingual corpora exist? • Canadian Hansards • Proceedings of the Hong Kong parliament • Dáil Proceedings • … • i.e. while the statistical techniques can be applied to any pair of languages, this approach is currently limited to only a few language pairs.

Bilingual Corpora (1/2) • W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards ... • Q1: For what English word are these possible candidate translations? • Q2: What’s “???”: • Beware of sparse data and unrepresentative corpora!! • Ditto poor quality language ... though I’ll come back to this one!

Bilingual Corpora (1/2) • W.r.t. ‘representativeness’, consider the following from a 9000 word experiment (Brown et al., 1990) on the Canadian Hansards ... • Q1: For what English word are these possible candidate translations? • Q2: What’s “???”: • Beware of sparse data and unrepresentative corpora!! • Ditto poor quality language ... though I’ll come back to this one! • If the corpora are small, or of poor quality, or are unrepresentative, then our statistical language models will be poor, so any results we achieve will be poor.

Is the World Wide Web a good corpus to use? • Let’s imagine you want to find the correct translation of the French compound (a word that consists of more than one token/stem) “groupe de travail”. • For each of the two French nouns here, there are several translations: • groupe: cluster, group, grouping, concern, collective • travail: work, labor, labour • If groupe de travail is not in our dictionary (cf. proliferation of compounds), but the two composite nouns are, then any compositional translation is multiply ambiguous. • Now here’s the trick: let’s search for all 15 possible pairs on the WWW. We might find: • labour cluster: 2 • labour concern: 9 • ..... • work group: 66593

Is the World Wide Web a good corpus to use? • There are at least two ways in which we could attempt to resolve the potential ambiguity: • simply take the most often occurring (frequency) term as the translation; • see which candidate translations surpass some threshold (in terms of relative frequency, say) • cf. share price, stock price: assuming these two massively outrank all other candidates, we can keep both as synonymous translations. • So, we can use the WWW to create a very large bilingual lexicon ...

Can we use the WWW to extract parallel corpora? • Lots of pages where you can click on a link to get a version of that page in a different language – how to find these? • Query a search engine for a string “English” ‘not too far away’ from the string “Spanish”. This might get you things like: • Click here for English version vs. Click here for Spanish version But also: • English literature vs. Spanish literature • What then? Need a process to evaluate these candidates. Using something like ‘diff’, you could align each case and manually remove poor candidates:

Can we use the WWW to extract parallel corpora? • Comparison using ‘diff’ • Or automate the comparison fully using string compare routines measured against some % threshold.... sunsite{away}691: diff a b 1c1 < Mary and John have two children. --- > Maria und Johannes haben zwei Kinder. 3c3 < The children that Mary and John have are aged 3 and 4. --- > Die Kinder, die Maria und Johannes haben, sind 3 und 4 Jahre alt. 5c5 < John has blue eyes. --- > Johannes hat blaue Augen.

Sentence Alignment • Manual construction of aligned corpora (which are essential for probabilistic techniques) also avoids the considerable problem of trying to align source and target texts. • But what if we already have bilingual corpora which are not aligned?! Automation!

Automatic Alignment • We’ve just seen one (novel) way of automating this process (using simple string comparison techniques). • What are the problems? • Most alignments are one-to-one ... but some are not, as we saw previously: • Some have one sentence correspond to two, or more, or none. • Or there may be a two-to-two sentence alignment without there being a one-to-one relation between the component sentences. • Let’s look at (some of) the major algorithms for aligning pairs of sentences ...

Gale & Church’s Algorithm (1/4) • Gale & Church (1993): ‘A program for aligning sentences in bilingual corpora’, in Computational Linguistics 19(1):75–102. • All sentence alignment algorithms are premised on the notion that good candidates are not dissimilar with regards to length (i.e. shorter sentences tend to have shorter translations then longer sentences). • G&C use this length-based notion, together with the probability that a typical translation is one of various many-to-many sentence relations (0-1, 1-0, 1-1, 2-1, 1-2, 2-2 etc). • Distance measure: P(match | d), where match = 1-1, 0-1 etc., and d = difference in length • How to calculate text length? • word tokens; • characters. • G&C use characters. No language maps character by character to another language, so first calculate the ratio of chars. in L1 to chars. in L2. e.g. English text = 100,000 chars; Spanish text = 110,000, then scaling ratio = 1.1

Gale & Church’s Algorithm (2/4) • Q: What’s the difference in length between an English text of 50 chars. and a Spanish text of 56 chars given this scaling ratio (1.1)? • Typically, the longer the text, the bigger the difference, so we need to normalize the difference in length to take into account longer texts: where ltand lsare the lengths of the target and source texts respectively, andlm is the average of the two lengths, c is the scaling factor, and s2 is the variance for all values dof in a typical aligned corpus, which G&C suggest is 6.8.

Gale & Church’s Algorithm (3/4) • Terms like P(match | d) are called conditional probabilities • Q: P(throwing at least 7 with 2 die)? • Q: P(throwing at least 7 | 1st dice thrown was a 6)? • Bayes’ Theorem is used to to relate probabilities: P(match | d) = P(d|match).P(match) P(d) • P(d) is a normalizing constant, so can be ignored. P(match) is estimated from correctly aligned corpora • G&C give one-to-one alignments a probability of 0.89, with the rest of the probability space assigned to other possible alignments. P(d| match) has to be estimated (cf. Trujillo, 1999:71).

Gale & Church’s Algorithm (4/4) Comment: • Aligning complete texts is computationally expensive. • Luckily, we can use paragraphs, sections,headings, chapters etc to identify chunks of texts and align these smaller elements. • That is, we’re looking for anchors which can identify text chunks in both languages. Other things can be tags, proper names, acronyms, cognates, figures, dates…(Q: can you think of others?) Results? • G&C report error rates of 4.2% on 1316 alignments. • Most errors occur in non 1-1 alignments. • Selection of the best 80% alignments reduces the error rate to 0.7%. • Interestingly, for European languages at least, the algorithm is not sensitive to different values for c and s2, but ‘noisy’ texts or from very different languages can degrade performance.

Recap • Data-driven (e.g. statistical) MT relies on the availability of parallel data that is: • Of sufficient quantity • Of sufficient (i.e. high) quality • Representative of the language we’re trying to model • Poor quality data = poor quality models • Too little data: data sparseness & poor coverage • Too poor quality: bad quality translations • Not representative: model will make wrong assumptions, produce incorrect translations • Data extracted from the web can be used to create some parallel texts • Best for dictionary extraction • Finding corresponding documents can be difficult • Parallel data needs to be aligned • Document-level alignment • Sentence-level alignment (manual vs automatic, using rel. sentence length ratio) • Gale & Church (1993) character-based sentence alignment

Brown et al.’s Algorithm (1/3) • Brown et al. (1991): ‘Aligning sentences in parallel corpora’, in Proceedings of 29th Annual Meeting of the Association for Computational Linguistics, University of California, Berkeley, pp.169–176. • Brown et al. measure the length of a text in terms of the number of word tokens. • Their method assumes alignments of text fragments are produced by a Hidden Markov Model (HMM), and the correct alignment is that which maximizes the output of the HMM • The HMM of Brown et al. models alignment by determining probabilities for eight types of alignments, or beads: s, t, st, sst, stt, ¶s, ¶t, ¶s ¶t • i.e. sst indicates that two source strings align with one target string (2-1), and ¶s signifies that a source paragraph delimiter matches nothing in the target language (i.e. it is deleted) (1-0).

Brown et al.’s Algorithm (2/3) English text (s) 17s 25s 12s … French text (t) 19t 20t 8t … • The probability of s bead emitting a sentence of length ls = P(ls) • We allow P(ls) = probability of finding sentence of length l in source text • Similarly, probability of a t bead emitted sentence of length lt = P(lt) • We allow P(lt) = probability of finding sentence of length l in target text • For st (i.e. a 1-1 alignment), P(ls,lt) depends on P(ls) and P(lt| ls) • Log of ratio (r) of ltto lsis normally distributed with a mean of and variance thus, if , then , with a constant ensuring that for all values of , 1 st stt

Brown et al.’s Algorithm (3/3) • For stt or sst beads, r is calculated by adding the lengths of the two source (or two target) strings. • The probability of both sentences is calculated by and , respectively • For this method, Brown el al. calculated and based on a sample corpus of 1000 sentences with a maximum length of 81 words. • The methodused anchor points (major: paragraph markers; minor: session points, time stamps, names of speakers…) to guide the alignment process • only consider sections that contain the same number of minor anchor points • Results: Reported error rates for st beads for English-French are 0.9% (very similar to Gale & Church’s approach)

Kay & Röscheisen’s Algorithm (1/2) • Kay & Röscheisen’s (1993): “Text-translation alignment” in Computational Linguistics 19(1):121-142 • Text-based alignment assumes set of anchors, parallel text and access to an English-French dictionary • Two sentences will be aligned iff word translation pairs from bilingual dictionary are found with sufficient frequency between two sentences (relative to size and to other sentences in corpus) • Compare with length based algorithms. Disadvantages: requires bilingual dictionary – always available? • Uses information on relative sentence position for English-French Canadian Hansards. • Previously seen relative sentence length (length-based algorithms)

Kay & Röscheisen’s Algorithm (2/2) • Assume 10 source and 12 target sentences. • Initial estimate alignment maps and • Let’s look at : Assume this aligned with +/- 3 sentences • i.e. is estimated to align with

Kay & Röscheisen’s Algorithm (2/2) • Assume 10 source and 12 target sentences. • Initial estimate alignment maps and • Let’s look at : Assume this aligned with +/- 3 sentences • i.e. is estimated to align with • Produce a set of word alignments by selecting word pairs in currently aligned sentences. • For example: if file occurs in and and fichier occurs in and , then sentence alignments and are made and the alignments become additional anchors. • Caveat: sentence anchors may not cross • e.g. if we have and then can only align with or but nothing after as this is already aligned with • Results: algorithm converges after 4/5 iterations and on first 1000 sentences all but 7 were aligned correctly, but high time/space complexity … file … … file … … fichier … … fichier …

Exact Matches • In Translation Memory (TM) (and EBMT systems) we would like to first look for exact matches for input or source language sentences • Some non-exact matches: Different spelling: Change the color of the font. Change the color of the font. Different punctuation: Open the file and select the text. Open the file, and select the text. Different inflection: Delete the document. Delete the documents. Different numbers: Use version 1.1. Use version 1.2. Different formatting: Click on OK. Click on OK.

Fuzzy Matches • Exact matches often do not occur; TM systems then use “fuzzy” matching: A Fuzzy Match New input segment: The specified operation failed because it requires the Character to be active. Stored TM Unit: EN: The specified language for the Character is not supported on the computer FR: La langue spécifiée pour le Compagnonn’estpas prise en charge par cetordinateur.

Fuzzy Matches • Exact matches often do not occur; TM systems then use “fuzzy” matching: A Fuzzy Match New input segment: The specified operation failed because it requires the Character to be active. Stored TM Unit: EN: The specifiedlanguage for the Character is not supported on the computer FR: La langue spécifiée pour le Compagnonn’estpas prise en charge par cetordinateur.

Fuzzy Matches • Exact matches often do not occur; TM systems then use “fuzzy” matching: Multiple Fuzzy Matches in Ranked Order New input segment: The operation was interrupted because the Character was hidden. Best Match: EN: The operation was interrupted because the Listening key was pressed. FR: L’opération a étéinterrompue car la touched’écout a étéenfoncée.

Fuzzy Matches • Exact matches often do not occur; TM systems then use “fuzzy” matching: Multiple Fuzzy Matches in Ranked Order New input segment: The operation was interrupted because the Character was hidden. Best Match: EN: The operation was interrupted because the Listening key waspressed. FR: L’opération a étéinterrompue car la touched’écout a étéenfoncée.

Fuzzy Matches • Exact matches often do not occur; TM systems then use “fuzzy” matching: Multiple Fuzzy Matches in Ranked Order New input segment: The operation was interrupted because the Character was hidden. Best Match: EN: The operation was interrupted because the Listening key was pressed. FR: L’opération a étéinterrompue car la touched’écout a étéenfoncée. 2nd Best Match: EN: The specified method failed because the Character is hidden. FR: La méthodespécifiée a échoué cat le Compagnonestmasqué 3nd Best Match: EN: The operation was interrupted by the application. FR: L’opération a étéinterrompue par l’application

Subsentential Alignment (1/4) • Words and terms tend to have similar distributions within two texts and can be expected to occur in the same aligned sentences. • Word alignment -> bilingual dictionaries. • e.g. ‘computer’ occurs in 30 English strings and ‘computadora’ in 29 Spanish strings and nowhere else: strong chance they are translations of each other. • Can measure this co-occurrence using Mutual Information: • e.g. given an aligned corpus of 100 strings, if ‘store’ occurs in 30 aligned English strings, ‘speichern’ occurs in 20 aligned German sentences and these alignments coincide in 19 cases, then for this corpus:

Subsentential Alignment (2/4) • Mutual information values: • positive values indicate the occurrence of one word strongly predicts that the other word will occur • A value near 0 indicates that they occur independently • Negative values predict that if one word occurs the other does not • Problem with MI: doesn’t provide probabilities (difficult to use in SMT systems) • Trick: use MI to align source-target words, then if a threshold can be determined empirically, rank them. If a has a higher MI score with then with then is deleted as a less like translation.

Subsentential Alignment (3/4) • Different technique: translation of source word can be found in ‘about the same position’ as the target sentence to which it is aligned, if not then within a fixed distance. • Algorithm: • Estimate word translation and offset (relates to word position) probabilities • Find the most probable alignment for the given text. • Naïve alignment: for a source word located 25% of the way through a source text, its translation is located 25% of the way +/- 20 words. • Ignoring function words (e.g. prepositions, determiners etc.) taken from a stoplist, we might induce the following alignments: A Simple Aligned Text EN: Start the operating system ES:Comenzar el sistemaoperativo EN: Launch the program via the ES: Empezar el programamediante keyboard el telcando

Subsentential Alignment (4/4) • comenzar start • *sistema  start • *empezat  system • telcado  keyboard • Two are right, two are wrong… but their correct translations lie close by • Assuming enough text, we assume the pairings (sistema,system) and (empezar,launch) can be confirmed as correct alignments. This leaves fewer candidates to be aligned. • For European languages at least, we can assume that words which are close together in the source translate as words which are close together in the target. • Similarly, words at the start/end of a source sentence map to words at the start/end of the target sentence. • Other subsentential alignment techniques: • Distance between occurrences of source word mirrors distance between occurrence of a target word. • Confirm aligned word pairs, using a third corpus as reference. A Simple Aligned Text EN: Start the operating system ES:Comenzar el sistemaoperativo EN: Launch the program via the ES: Empezar el programamediante keyboard el telcando

Parallel Data & Sentence Alignment