1 / 10

Identifying Translations

Identifying Translations. Philip Resnik, Noah Smith University of Maryland. Reasons to identify translations. Locating parallel text on the Web Filtering out poor quality translations Cross-language duplicate detection/caching. Comparison. N. %. κ. J1, J2. 267. 0.98. 0.95.

sloan
Télécharger la présentation

Identifying Translations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying Translations Philip Resnik, Noah Smith University of Maryland

  2. Reasons to identify translations • Locating parallel text on the Web • Filtering out poor quality translations • Cross-language duplicate detection/caching

  3. Comparison N % κ J1, J2 267 0.98 0.95 J1, STRAND 273 0.88 0.70 J2, STRAND 315 0.88 0.69 J1J2, STRAND 261 0.90 0.75 Identifying translations using structure STRAND (Resnik, 1999)

  4. Related Work • Web mining for parallel text (Nie et al. 1999) • Sentence alignment (Fluhr et al. 2000) • Duplicate detection (e.g. Broder et al. 1997)

  5. t e = f t t used to define  and  Translational Equivalence as a Function over Sets • Broder et al (1997): Document representation as a set of “shingles” S(D) |S(D1)  S(D2)| r(D1,D2) = |S(D1)  S(D2)| • Cross language generalization: partial equality with confidence value t(e,f)

  6. Ways of computing equivalence • Bilingual dictionaries • t(e,f) = 1 if (e,f) present in dictionary, 0 otherwise • Translation model (Melamed 2000, model A) • t(e,f) = Pr(e,f) • String similarity for cognates • t(e,f) = Longest common substring ratio (LCSR) variant • Trained on non-zero entries in translation model

  7. Evaluation task • Given segmented corpus C1 in L1, C2 in L2 • Assume each segment has 0 or 1 translation equivalents • Match up the equivalents • Equivalent to maximum bipartite matching problem • Exhaustive solution available for small sets • Approximated using competitive linking (Melamed) • True equivalence pairs give precision/recall curve

  8. Some results: sentence matching • Task corpora: • Chinese-English: Hong Kong Laws sentences • 5622 training sentences, 191 test sentences • Spanish-English: U.N. Parallel Corpus • 4695 training sentences, 200 test sentences English-Chinese English-Spanish

  9. Some results: document matching • Task corpora: • 232 English-French Web documents

  10. New directions • Exploiting the Internet Archive • 100-200 million pages (4TB) on disk • Exhaustive URL matching within site • STRAND now adapted for disk-based access • Combining structure and content • Improving document-level matching • Selecting good chunks within documents

More Related