1 / 53

Multilingual Information Retrieval

Multilingual Information Retrieval. Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA. Global Trade. USA. EU. China. Japan. Hong Kong. South Korea. Source: Wikipedia (mostly 2017 estimates). Most Widely-Spoken Languages.

brackett
Télécharger la présentation

Multilingual Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MultilingualInformation Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA AFIRM

  2. Global Trade USA EU China Japan Hong Kong South Korea Source: Wikipedia (mostly 2017 estimates)

  3. Most Widely-Spoken Languages Source: Ethnologue (SIL), 2018

  4. Global Internet Users Web Pages

  5. What Does “Multilingual” Mean? • Mixed-language document • Document containing more than one language • Mixed-language collection • Collection of documents in different languages • Multi-monolingual systems • Can retrieve from a mixed-language collection • Cross-language system • Query in one language finds document in another • (Truly) multingual system • Queries can find documents in any language

  6. A Story in Two Parts • IR from the ground up in any language • Focusing on document representation • Cross-Language IR • To the extent time allows

  7. Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits

  8. | 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | ASCII • American Standard Code for Information Interchange • ANSI X3.4-1968 | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |

  9. The Latin-1 Character Set • ISO 8859-1 8-bit characters for Western Europe • French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

  10. Other ISO-8859 Character Sets -2 -6 -7 -3 -4 -8 -9 -5

  11. East Asian Character Sets • More than 256 characters are needed • Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets • GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam • Many characters appear in several languages • Research Libraries Group developed EACC • Unified “CJK” character set for USMARC records

  12. Unicode • Single code for all the world’s characters • ISO Standard 10646 • Separates “code space” from “encoding” • Code space extends Latin-1 • The first 256 positions are identical • UTF-7 encoding will pass through email • Uses only the 64 printable ASCII characters • UTF-8 encoding is designed for disk file systems

  13. Limitations of Unicode • Produces larger files than Latin-1 • Fonts may be hard to obtain for some characters • Some characters have multiple representations • e.g., accents can be part of a character or separate • Some characters look identical when printed • But they come from unrelated languages • Encoding does not define the “sort order”

  14. Strings and Segments • Retrieval is (often) a search for concepts • But what we actually search are character strings • What strings best represent concepts? • In English, words are often a good choice • Well-chosen phrases might also be helpful • In German, compounds may need to be split • Otherwise queries using constituent words would fail • In Chinese, word boundaries are not marked • Thissegmentationproblemissimilartothatofspeech

  15. Tokenization • Words (from linguistics): • Morphemes are the units of meaning • Combined to make words • Anti (disestablishmentarian) ism • Tokens (from computer science) • Doug ’s running late !

  16. Morphological Segmentation Swahili Example Credit: RamyEskander

  17. Morphological Segmentation Somali Example Credit: RamyEskander

  18. Stemming • Conflates words, usually preserving meaning • Rule-based suffix-stripping helps for English • {destroy, destroyed, destruction}: destr • Prefix-stripping is needed in some languages • Arabic: {alselam}: selam [Root: SLM (peace)] • Imperfect: goal is to usually be helpful • Overstemming • {centennial,century,center}: cent • Understamming: • {acquire,acquiring,acquired}: acquir • {acquisition}: acquis • Snowball: rule-based system for making stemmers

  19. Longest Substring Segmentation • Greedy algorithm based on a lexicon • Start with a list of every possible term • For each unsegmented string • Remove the longest single substring in the list • Repeat until no substrings are found in the list

  20. Longest Substring Example • Possible German compound term (!): • washington • List of German words: • ach, hin, hing, sei, ton, was, wasch • Longest substring segmentation • was-hing-ton • Roughly translates as “What tone is attached?”

  21. oil petroleum probe survey take samples probe survey take samples cymbidium goeringii oil petroleum restrain

  22. Probabilistic Segmentation • For an input string c1 c2 c3 …cn • Try all possible partitions into w1 w2w3 … • c1c2 c3 …cn • c1c2 c3 c3…cn • c1 c2 c3 …cn • etc. • Choose the highest probability partition • Compute Pr(w1 w2w3) using a language model • Challenges: search, probability estimation

  23. Non-Segmentation:N-gram Indexing • Consider a Chinese document c1 c2 c3 …cn • Don’t segment (you could be wrong!) • Instead, treat every character bigram as a term c1 c2 ,c2 c3 ,c3 c4 ,… , cn-1 cn • Break up queries the same way

  24. A “Term” is Whatever You Index • Word sense • Token • Word • Stem • Character n-gram • Phrase

  25. Summary • A term is whatever you index • So the key is to index the right kind of terms! • Start by finding fundamental features • We have focused on character coded text • Same ideas apply to handwriting, OCR, and speech • Combine characters into easily recognized units • Words where possible, character n-grams otherwise • Apply further processing to optimize results • Stemming, phrases, …

  26. A Story in Two Parts • IR from the ground up in any language • Focusing on document representation • Cross-Language IR • To the extent time allows

  27. Somali Document Collection Translation System Retrieval Engine English Document Collection Query-Language CLIR Results examine select English queries

  28. Somali Document Collection Retrieval Engine Translation System English queries Document-Language CLIR Somali documents Results Somali queries examine select

  29. Query vs. Document Translation • Query translation • Efficient for short queries (not relevance feedback) • Limited context for ambiguous query terms • Document translation • Rapid support for interactive selection • Need only be done once (if query language is same)

  30. Indexing Time:Statistical Document Translation

  31. Language-Neutral Retrieval Somali Query Terms Query “Translation” English Document Terms “Interlingual” Retrieval 1: 0.91 2: 0.57 3: 0.36 Document “Translation”

  32. Translation Evidence • Lexical Resources • Phrase books, bilingual dictionaries, … • Large text collections • Translations (“parallel”) • Similar topics (“comparable”) • Similarity • Similar writing (if the character set is the same) • Similar pronunciation • People • May be able to guess topic from lousy translations

  33. Types of Lexical Resources • Ontology • Organization of knowledge • Thesaurus • Ontology specialized to support search • Dictionary • Rich word list, designed for use by people • Lexicon • Rich word list, designed for use by a machine • Bilingual term list • Pairs of translation-equivalent terms

  34. Full Query Named entities added Named entities from term list Named entities removed

  35. mange surface form surface form eat stem surface form mange surface form stem mange mange stem stem Backoff Translation • Lexicon might contain stems, surface forms, or some combination of the two. Document Translation Lexicon mangez mangez - eat mangez mange - eats mange mangez - eat mangez mangent - eat

  36. Hieroglyphic Egyptian Demotic Greek

  37. Types of Bilingual Corpora • Parallel corpora: translation-equivalent pairs • Document pairs • Sentence pairs • Term pairs • Comparable corpora: topically related • Collection pairs • Document pairs

  38. Some Modern Rosetta Stones • News: • DE-News (German-English) • Hong-Kong News, Xinhua News (Chinese-English) • Government: • Canadian Hansards (French-English) • Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) • UN Treaties (Russian, English, Arabic, …) • Religion • Bible, Koran, Book of Mormon

  39. Word-Level Alignment English Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform German English Madam President , I had asked the administration … Señora Presidenta, había pedido a la administración del Parlamento … Spanish

  40. A Translation Model • From word-aligned bilingual text, we induce a translation model • Example: where, p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05

  41. Using Multiple Translations • Weighted Structured Query Translation • Takes advantage of multiple translations and translation probabilities • TF and DF of query term e are computed using TF and DF of its translations:

  42. BM-25 term frequency document frequency document length

  43. Retrieval Effectiveness CLEF French

  44. Bilingual Query Expansion source language query Source Language IR Query Translation Target Language IR results expanded source language query expanded target language terms source language collection target language collection Pre-translation expansion Post-translation expansion

  45. Query Expansion Effect Paul McNamee and James Mayfield, SIGIR-2002

  46. Cognate Matching • Dictionary coverage is inherently limited • Translation of proper names • Translation of newly coined terms • Translation of unfamiliar technical terms • Strategy: model derivational translation • Orthography-based • Pronunciation-based

  47. Matching Orthographic Cognates • Retain untranslatable words unchanged • Often works well between European languages • Rule-based systems • Even off-the-shelf spelling correction can help! • Subword (e.g., character-level) MT • Trained using a set of representative cognates

  48. Matching Phonetic Cognates • Forward transliteration • Generate all potential transliterations • Reverse transliteration • Guess source string(s) that produced a transliteration • Match in phonetic space

  49. Query Translated Query Search Cross-Language “Retrieval” Query Translation Ranked List

  50. Query Uses of “MT” in CLIR Term Translation Term Matching Query Formulation Translated Query Snippet Translation Query Translation Indicative Translation Search Ranked List Informative Translation Selection Document Examination Document Query Reformulation Use

More Related