1 / 44

Comparable Corpora

Comparable Corpora. Kashyap Popat (113050023) Rahul Sharnagat (11305R013). Outline. Motivation Introduction: Comparable Corpora Types of corpora Methods to extract information from comparable corpora Bilingual dictionary Parallel sentences Conclusion. Motivation.

argyle
Télécharger la présentation

Comparable Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparable Corpora KashyapPopat(113050023) RahulSharnagat(11305R013)

  2. Outline • Motivation • Introduction: Comparable Corpora • Types of corpora • Methods to extract information from comparable corpora • Bilingual dictionary • Parallel sentences • Conclusion

  3. Motivation • Corpus: the most basic requirement in statistical NLP • Large amount of bilingual text on web • Bilingual Dictionary generation • One to one correspondence between words • Parallel Corpus generation • One to one correspondence between sentences • Very rare resource (Hindi – Chinese)

  4. Comparable corpora[7] • “A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.” • Characteristics of Comparable corpora • No parallel sentences • No parallel paragraphs • Fewer overlapping terms and words Definition by EAGLES

  5. Spectrum of Corpora Unrelated corpora Comparable corpora Transcription Parallel corpora - sentence by sentence aligned

  6. A comparable corpora

  7. Application of comparable corpora • Generating bilingual lexical entries (dictionary) • Creating parallel corpora

  8. Generating bilingual lexical entries

  9. Basic postulates[1] • Words with productive context in one language translate to word with productive context in second language e.g., table  मेज़ • Words with rigid context translate into words with rigid context e.g., Haemoglobin  रक्ताणु • Correlation: between co-occurrence pattern in different languages Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus, Fung, 1995

  10. Co-occurrence patterns[4] • If a term A co-occurs with another term B in some text T then its translation A' also co-occurs with B‘(translation of B) in some other text T' T’ B’ T A B A’ Automatic Identification of Word Translations from Unrelated English and German Corpora. R. Rapp, 1999

  11. Co-occurrence Histogram[2] • For the word ‘Debenture’ Count Words Finding terminology translations from non-parallel corpora. Fung, 1997

  12. Basic Approach[3] • Calculate the co-occurrence matrix for all the words in source language L1 and target language L2 • Word order of the L1 matrix is permuted until the resulting pattern is most similar to that of the L2 matrix Identifying word translations in nonparallel, Rapp, R. ,1995

  13. English co-occurrence matrix • L1 matrix

  14. Hindi co-occurrence matrix • L2 matrix

  15. Hindi co-occurrence matrix • L2 matrix: after permutations

  16. Result • Comparing the order of the words in L1 matrix and permuted L2 matrix

  17. Problems • Permuting co-occurrence matrix is expensive • Size of the vector # of unique terms in the language

  18. A New Method[2] • Dictionary entries are used as seed words to generate correlation matrices • Algorithm: • A bilingual list of known translation pairs (seed words) is given • Step-1: For every word ‘e’ in L1, find its correlation vector (M1 ) with every word of L1 in the seed words • Step-2: For every words ‘c’ in L2, find its correlation vector (M2 ) with every word of L2 in the seed words • Step-3: Compute correlation(M1, M2); if it is high, ‘e’ and ‘c‘ are considered as a translation pair Finding terminology translations from non-parallel corpora. Fung, 1997

  19. Co-occurrence Seed word List Flower फूल

  20. Crux of the Algorithm • Two main steps: • Formation of co-occurrence matrix • Measuring the similarity between vectors • Different possible methods to calculate above two steps • Advantage: vector size reduces to # of unique words in the seed list

  21. Improvements • Window Size for co-occurrence calculation[2] • Should it be same for all the words ? • Co-occurrence Counts • Similarity Measure Finding terminology translations from non-parallel corpora. Fung, 1997

  22. Co-occurrence count • Mutual Information (Church & Hanks, 1989) • Conditional Probability (Rapp, 1996) • Chi-Square Test (Dunning, 1993) • Log-likelihood Ratio (Dunning, 1993) • TF-IDF (Fung et al 1998) Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

  23. Mutual Information[2] (1/2) k11 = # of segments where both, ws and wtoccur k12 = # of segments where only ws occur k21 = # of segments where only wt occur k22 = # of segments where neither words occur • Segments: sentences, paragraphs, or string groups delimited by anchor paints Finding terminology translations from non-parallel corpora. Fung, 1997

  24. Mutual Information[2] (2/2) • Weighted mutual information

  25. Similarity Measures (1/2) • Cosine similarity (Fung and McKeown,1997) • Jaccard similarity (Grefenstette,1994) • Dice similarity (Rapp, 1999) • L1 norm / City block distance (Jones & Furnas, 1987) • L2 norm / Euclidean distance (Fung, 1997) Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

  26. Similarity Measures (2/2) • L1 norm / City block distance • L2 norm / Euclidian distance • Cosine Similarity • Jaccard Similarity

  27. Problems with the approach[5] • Coverage: only few corpus words are covered by the dictionary • Synonymy / Polysemy: several entries have the same meaning (synonymy), or an entry has several meanings (polysemy) • Similarities w.r.t. synonyms should not be independent • Improvements in the form of Geometric approaches • Projects the co-occurrence vectors of source and target word on a dictionary entries • Measures the similarity between the projected vectors A geometric view on bilingual lexicon extraction from comparable corpora. Gaussier, et al., 2004

  28. Results

  29. Generating parallel corpora

  30. Generating Parallel Corpora • Involves aligning the sentences in the comparable corpora to form a parallel corpora • Ways to do: • Dictionary matching • Statistical methods

  31. Ways to do alignment • Dictionary matching • If the words in given two sentences are translation of each other, it is most likely that the sentences are translation of each other • Process is very slow • Accuracy is high but cannot be applied to large corpus • Statistical methods • To predict the alignment, these methods make use of distribution of length of sentence in corpus either in terms of words (Brown, 1996) or characters (Gale and Church, 1991) • Makes no use of any lexical resources • Fast and accurate

  32. Length based statistical approach • Preprocessing • Segment the text into tokens • Combine the token into groups (nothing but sentences) • Find anchor point • Find points in corpus, where we are sure that start and end points in one language of the corpus aligns to start and end points in other language of the corpus • Finding these points require analysis of corpus

  33. Example • Brown et al., already had anchors in their corpus • Used UK parliament proceedings ‘Hansards‘ as a parallel corpus • Each proceeding start with a comment, time of proceeding, who was giving the speech etc. • This information provides the anchor points. Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

  34. Aligning anchor points • Anchor points are not always perfect • Some may be missing • Some may be garbled • To find the alignment between these anchors, Dynamic programming technique is used • We find an alignment of the major anchors in the two corpora with the least total cost

  35. Beads • Upper level view can be that corpus is sequence of sentence lengths occasionally separated by paragraph markers • Each of these groupings is called a bead • Bead is a type of sentence grouping Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

  36. Beads Example of beads

  37. Problem Formulation • Sentences between the anchored points get generated by two random processes • Producing a sequence of beads • Choosing the length of the sentence(s) in each bead • Bead generation can be modeled using a two state Markov model • One sentence can align to zero, one or two sentence in the other side • Allows any of the eight beads as shown in previous table • Assumptions ,

  38. Modeling length of sentence • Model probability of length of sentence given its bead • Assumptions are made: • e-beads and f-beads: Probability of le or lf is same as probability of le or lf in the whole corpus • ef-bead: • English sentence: length le with probability Pr(le) • French sentence: that log ratio of French to English sentence length is normal distributed with mean µ and variance Where r = log(lf | le )

  39. Contd.. • eef-bead: • English sentence: drawn from Pr(le) • French sentence: r is distributed according to same normal distribution • eff-bead: same uniform distribution holds with r as • English sentence: drawn from Pr(le) • French sentence: r is distributed according to same normal distribution • Given the sum of lengths of French sentences, probability for particular pair lf1 and lf2 is proportional to

  40. Parameter Estimation • Using EM Algorithm, estimate the parameters of the Markov model • Following results were obtained Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

  41. Results • In a random sample of 1000 sentences, only 6 were not translation of each other • Brown et al. have also studied the effect of anchors points • According to them, • with paragraph marker but no anchor points, 2.0% error rate is expected • with anchor points but no paragraph marker, 2.3% error rate is expected • with neither anchor point nor paragraph marker, 3.2% error is rate expected

  42. Conclusion • Comparable corpora can be used to generate bilingual dictionary and parallel corpora • Generating bilingual dictionary • Polysemy and sense disambiguation still remains a major challenge • Generating parallel corpora • Given the aligned points, aligner is likely to give good results • The experiments were very specific to corpora, hard to generalize the accuracy • The sentences of length which has a highest chance to get aligned but with completely wrong translation might confuse the aligner

  43. References • Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora, Boston, Massachusetts, 173-183 • Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora, Hong Kong, 192-202. • R. Rapp (1995). Identifying word translations in nonparallel texts. In: Proceedings of the 33rd Meeting of the Association for Computational Linguistics. Cambridge, Massachusetts, 320-322. • R. Rapp. (1999). Automatic Identification of Word Translations from Unrelated English and German Corpora. Proceedings of the ACL-99. pp. 1–17. College Park, USA.

  44. References • Gaussier, Eric, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and HerveDejean. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain. • Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (ACL '91). • http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html

More Related