1 / 104

Cross-Language Retrieval

Cross-Language Retrieval. LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001. Agenda. Questions Overview The information The users Cross-Language Search User Interaction. The Grand Plan. Phase 1: What makes up an IR system? perspectives on the elephant

longo
Télécharger la présentation

Cross-Language Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001

  2. Agenda • Questions • Overview • The information • The users • Cross-Language Search • User Interaction

  3. The Grand Plan • Phase 1: What makes up an IR system? • perspectives on the elephant • Phase 2: Representations • words, ratings • Phase 3: Beyond English text • ideas applied in many settings

  4. A Driving Example • Visual History Foundation • Interviews with Holocaust survivors • 39 years’ worth of audio/video • 32 languages; accented, emotional speech • 30 people, 2 years : $12 million • Joint project: MALACH • VHF, IBM, JHU, UMD • http://www.clsp.jhu.edu/research/malach

  5. Information Access Information Use Translingual Search Translingual Browsing Translation Select Examine Query Document

  6. A Little (Confusing) Vocabulary • Multilingual document • Document containing more than one language • Multilingual collection • Collection of documents in different languages • Multilingual system • Can retrieve from a multilingual collection • Cross-language system • Query in one language finds document in another • Translingual system • Queries can find documents in any language

  7. Who needs Cross-Language Search? • When users can read several languages • Eliminate multiple queries • Query in most fluent language • Monolingual users can also benefit • If translations can be provided • If it suffices to know that a document exists • If text captions are used to search for images

  8. Motivations • Commerce • Security • Social

  9. Global Internet Hosts Source: Network Wizards Jan 99 Internet Domain Survey

  10. Global Web Page Languages Source: Jack Xu, Excite@Home, 1999

  11. European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

  12. European Web Size Projection Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

  13. Global Internet Audio Almost 2000 Internet-accessible Radio and Television Stations source: www.real.com, Feb 2000

  14. 13 Months Later About 2500 Internet-accessible Radio and Television Stations source: www.real.com, Mar 2001

  15. User Needs Assessment • Who are the potential users? • What goals do we seek to support? • What language skills must we accommodate?

  16. Global Languages Source: http://www.g11n.com/faq.html

  17. Global Trade Billions of US Dollars (1999) Source: World Trade Organization 2000 Annual Report

  18. Global Internet User Population 2000 2005 English English Chinese Source: Global Reach

  19. Agenda • Questions • Overview • Cross-Language Search • User Interaction

  20. Monolingual Searcher Cross-Language Searcher Choose Document-Language Terms Choose Query-Language Terms Infer Concepts Select Document-Language Terms Query The Search Process Author Choose Document-Language Terms Query-Document Matching Document

  21. Some history: from controlled vocabular to free text • 1964 International Road Research • Multilingual thesauri • 1970 SMART • Dictionary-based free-text cross-language retrieval • 1978 ISO Standard 5964 (revised 1985) • Guidelines for developing multilingual thesauri • 1990 Latent Semantic Indexing • Corpus-based free-text translingual retrieval

  22. Multilingual Thesauri • Build a cross-cultural knowledge structure • Cultural differences influence indexing choices • Use language-independent descriptors • Matched to language-specific lead-in vocabulary • Three construction techniques • Build it from scratch • Translate an existing thesaurus • Merge monolingual thesauri

  23. Free Text CLIR • What to translate? • Queries or documents • Where to get translation knowledge? • Dictionary or corpus • How to use it?

  24. Translingual Retrieval Architecture Chinese Term Selection Monolingual Chinese Retrieval 1: 0.72 2: 0.48 Language Identification Chinese Term Selection Chinese Query English Term Selection Cross- Language Retrieval 3: 0.91 4: 0.57 5: 0.36

  25. Evidence for Language Identification • Metadata • Included in HTTP and HTML • Word-scale features • Which dictionary gets the most hits? • Subword features • Character n-gram statistics

  26. Query-Language Retrieval Chinese Query Terms English Document Terms Monolingual Chinese Retrieval 3: 0.91 4: 0.57 5: 0.36 Document Translation

  27. Example: Modular use of MT • Select a single query language • Translate every document into that language • Perform monolingual retrieval

  28. Is Machine Translation Enough? TDT-3 Mandarin Broadcast News Systran Balanced 2-best translation

  29. Document-Language Retrieval Chinese Query Terms Query Translation English Document Terms Monolingual English Retrieval 3: 0.91 4: 0.57 5: 0.36

  30. Query vs. Document Translation • Query translation • Efficient for short queries (not relevance feedback) • Limited context for ambiguous query terms • Document translation • Rapid support for interactive selection • Need only be done once (if query language is same) • Merged query and document translation • Can produce better effectiveness than either alone

  31. The Short Query Challenge Source: Jack Xu, Excite@Home, 1999

  32. Interlingual Retrieval Chinese Query Terms Query Translation English Document Terms Interlingual Retrieval 3: 0.91 4: 0.57 5: 0.36 Document Translation

  33. Wrong segmentation Which translation? No translation? Key Challenges in CLIR probe survey take samples cymbidium goeringii oil petroleum restrain

  34. Sources of Evidence for Translation • Corpus statistics • Lexical resources • Algorithms • The user

  35. Hieroglyphic Egyptian Demotic Greek

  36. Types of Bilingual Corpora • Parallel corpora: translation-equivalent pairs • Document pairs • Sentence pairs • Term pairs • Comparable corpora: topically related • Collection pairs • Document pairs

  37. Exploiting Parallel Corpora • Automatic acquisition of translation lexicons • Statistical machine translation • Corpus-guided translation selection • Document-linked techniques

  38. Word alignment (GIZA) STRAND … cannot understand crew commands… ne comprenez pas les instructions de l’ equip… Association stats Chunk-level alignment Frequency-based thresholding Lexicon acquisition from the WWW 63K chunks 500K words 3378 document pairs 170K entries

  39. Corpus-Guided Translation Selection • Rank translation alternatives for each term • pick English word e that maximizes Pr(e) • Pick English word e that maximizes Pr(e|c) • Pick English words e1…en maximizing Pr(e1…en|c1…cm) = statistical machine translation! • Unigram language models are easy to build • Can use the collection being searched • Limits uncommon translation and spelling error effects

  40. Corpus-Based CLIR Example French Query Terms Top ranked French Documents Top ranked English Documents Parallel Corpus English Translations French IR System English IR System

  41. Exploiting Comparable Corpora • Blind relevance feedback • Existing CLIR technique + collection-linked corpus • Lexicon enrichment • Existing lexicon + collection-linked corpus • Dual-space techniques • Document-linked corpus

  42. Blind Relevance Feedback • Augment a representation with related terms • Find related documents, extract distinguishing terms • Multiple opportunities: • Before doc translation: Enrich the vocabulary • After doc translation: Mitigate translation errors • Before query translation: Improve the query • After query translation: Mitigate translation errors • Short queries get the most dramatic improvement

  43. English Query Example: Post-Translation “Document Expansion” IR System Document to be Indexed Term Selection Top 5 IR System Results Single Document Term-to-Term Translation English Corpus Automatic Segmentation Mandarin Chinese Documents

  44. Post-Translation Document Expansion Mandarin Newswire Text

  45. Why Document Expansion Works • Story-length objects provide useful context • Ranked retrieval finds signal amid the noise • Selective terms discriminate among documents • Enrich index with low DF terms from top documents • Similar strategies work well in other applications • CLIR query translation • Monolingual spoken document retrieval

  46. … Cross-Language Evaluation Forum … ? … Solto Extunifoc Tanixul Knadu … Lexicon Enrichment Similar techniques can guide translation selection

  47. Lexicon Enrichment • Use a bilingual lexicon to align “context regions” • Regions with high coincidence of known translations • Pair unknown terms with unmatched terms • Unknown: language A, not in the lexicon • Unmatched: language B, not covered by translation • Treat the most surprising pairs as new translations • Not yet tested in a CLIR application

  48. English Terms Spanish Terms E1 E2 E3 E4 E5 S1 S2 S3 S4 Doc 1 4 2 2 1 Doc 2 8 4 4 2 Doc 3 2 2 1 2 Doc 4 2 1 2 1 Doc 5 4 1 2 1 Learning From Document Pairs

  49. Similarity “Thesauri” • For each term, find most similar in other language • Terms E1 & S1 (or E3 & S4) are used in similar ways • Treat top related terms as candidate translations • Applying dictionary-based techniques • Performed well on comparable news corpus • Automatically linked based on date and subject codes

More Related