A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective by Ying Alvarado (24401693) CSE 8337 Lecturer : Dr. Margaret Dunham April 26, 2007

Outline • Introduction • Concept • Why important • Approach • CLIR problems • Resource • Approaches • Example Techniques • A CLIR application system • CLIR effectiveness • CLIR future tasks • CLIR communities • References

Cross Language IR • Definition: Users enter their query in one language and the system retrieves relevant documents in other languages. • For example, a user may pose their query in English but retrieve relevant documents written in French. • Example CLIR applications • Cross-Language retrieval from texts • Cross-Language retrieval from audio and images In this presentation, we focus on text IR only! [1] Wikipedia, http://en.wikipedia.org/wiki/Cross-language_information_retrieval [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

IR system Request (L1) Results(L1) Documents (L1 ) Monolingual vs. Bilingual vs. Multilingual • Monolingual IR: Documents and user requests in the same language • Cross-language IR: • Documents and user requests are in different languages (bilingual IR) Cross-language IR (CLIR) system Request (L1) Results(L2) Documents (L2 ) Source language Target language [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

Monolingual vs. Bilingual vs. Multilingual (con.) • Multilingual IR: • Documents in collection in different languages, search requests in any language Multilingual IR (MLIR) system Request (L?) Results (L2, L3 or L4) Documents (L4 ) Documents (L3) Documents (L2 ) e.g. the Web

Why CLIR? Mar. 10, 2007 [3] Internet World Stats, http://www.internetworldstats.com/stats7.htm

Why CLIR? (con.) • A collection may contains documents in many different languages, e.g. the Web. It would be impractical to form a query in each language. • The documents may be expressed in more than one languages. For example, • Technical documents in which English jargon appears intermixed with narrative text in another language. • Academic works which cite the titles of documents in different languages. • The user is not sufficiently fluent to express a query in a language, but is able to make use of the documents that are identified. • The user is monolingual and wants to query in their native language. Because he • can judge relevance even if results not translated • have access to document translation [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

CLIR problems • Handling non-ASCII character sets • Untranslatable search keys (OOV): e.g. compound words, proper names, special terms • Multi-word concepts, e.g. phrases and idioms • Ambiguity, e.g. Homonymy and polysemy • Word Inflections, e.g. plurals and gender [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [5] Ari Pirkola, et al. Dictionary-Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol. 4. 2001

Resources for Translation • Ontology • Representation of concepts and relationships • Thesaurus • it more commonly means a listing of words with similar, related, or opposite meanings • It does not include the definition of words • Bilingual dictionary • a list of words together with additional word-specific information. • Bilingual controlled vocabulary • carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search • Corpora • The document collection itself [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006 [1] Wikipedia. Related pages. [7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004

An example of controlled vocabulary The hierarchical relationships The equivalence relationship Women’s Pants: BT Pants NT Casual Pants NT Dress Pants NT Sports Pants [14] Boxes and Arrows, http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary

What to translate? • Document translation • Text translation • E.g., translate entire document collection into English → search collection in English • Vector translation • Query translation • E.g., translate English query into Chinese query → search Chinese document collection [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

Tradeoffs • Document Translation • Documents can be translate and stored offline • Dependent on high quality automatic machine translation (MT) system • Does not easily deal with changing document sets • Query Translation • Often easier • Disambiguation of query terms may be difficult with short queries [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

Approaches to query translation • Knowledge-based: Several aspects of domain knowledge is manually encoded in to a lexicon. • Ontology-based (concept driven) • Thesaurus-based • Dictionary-based Expensive to construct lexicons; Lag behind the common use of terminology. • Corpus-based: directly exploit statistical information about term usage in a corpora; automatically construct lexicon. • Parallel corpora: document pairs, sentence pairs, term pairs • Comparable corpora: document pairs, similar content • Unaligned corpora: documents from the same domain, not translations of one another, not linked in any other way [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001 [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Applying monolingual IR techniques • Query expansion • Relevance feedback • Stemming • Latent semantic analysis • Parsing • Part of speech tagging …… [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

Multilingual Thesauri • Three construction techniques • Build it from scratch • Translate an existing thesaurus • Merge monolingual thesauri • For example EuroWordNet • 7 languages • Built from existing lexical resources • Has the same structure as Princeton WordNet [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001 [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Pseudo-Relevance Feedback • Also call Blind feedback • Assume that the top n documents in the result set actually are relevant. • Enter query terms in French • Find top French documents in parallel corpus • Construct a query from English translations • Perform a monolingual free text search French Query Terms Top ranked French Documents English Web Pages English Translations French Text Retrieval System Parallel Corpus AltaVista [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Different level alignment in parallel corpora • Document alignment • Already exists • Collected from existing corpora • Examine document external features • Examine document internal features • Sentence alignment • Easily constructed from aligned documents • Match pattern of relative sentence lengths • Good first step for term alignment • Term alignment • Using co-occurrence-based translation [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

CSE8337是一门关于信息存储和检索的课程。 CSE8337 is a class about information storage and retrieval. Example of term alignment

Co-occurrence-based translation • Align terms using co-occurrence statistics • assumed that the correct translations of query terms tend to co-occur in target language documents • How often do a term pair occur in sentence pairs? • Weighted by relative position in the sentences • Retain term pairs that occur unusually often [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Exploiting Unaligned Corpora • Example approach: category-based translation • Extract a large number of terms from unaligned coprora of the first and second languages • Assign a category to each extracted term by accessing monolingual thesauri of the first and second languages • Estimate category-to-category translation probabilities • Estimate term-to-term translation probabilities using said category-to-category translation probabilities [15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent 6885985. Filing date: Dec 18, 2000. Issue date: Apr 26, 2005

Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable In Summary [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001

An experimental system Automatic construction of parallel English-Chinese corpus for CLIR • A parallel text mining system- PTMiner • Finds parallel text from web • Parallel Text Mining Algorithm • Search for candidate sites - Using existing Web search engines, search for the candidate sites that may contain parallel pages; (by using text anchor) • File name fetching - For each candidate site, fetch the URLs of Web pages that are indexed by the search engines; • Host crawling - Starting from the URLs collected in the previous step, search through each candidate site separately for more URLs; • Pair scan - From the obtained URLs of each site, scan for possible parallel pairs; (by analyzing document external features) • Download and verifying - Download the parallel pages, determine file size, language and character set, text length, HTML structure, and filter out non-parallel pairs. [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

The workflow of the mining process • Sample anchor texts: “english version” [“in english”, ……] • Sample document external features: “file-ch.html” vs. “file-en.html” “…/chinese/…/file.html” vs. “…/english/…file.html” • Sample document internal features: Character set, HTML structure [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

An alignment example [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Part of the lexicons • t: ture • f: false Other techniques and tools used: • Encoding scheme transformation (for Chinese) • Sentence level segmentation • Chinese word segmentation • English expression extraction • SILC: language and encoding identification system [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Results • 14820 pairs of texts (lexicon) • C-E has a precision of 77% • E-C has a precision of 81.5% • CLIR results • Test corpus: TREC5 and TREC6 Chinese track [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Does CLIR work? • Best systems at TREC-6 (1997): • English-French: 49% of highest French monolingual • English-German: 64% of highest German monolingual • Best systems at CLEF (2002): • English-French: 83% of highest French monolingual • English-German: 86% of highest German monolingual • Best systems at CLEF (2006): • English-French: 93.82% of best French monolingual • English-Portuguese: 90.91% of best Portuguese monolingual [2]Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

Future tasks • Extend study scope: • Web pages, medical literature, USENET newsgroup articles, records of legislative and legal proceedings… • Lower cost, improve efficiency • Pay more attention on indexing-time optimizations to improve query-time efficiency • Consider user’s perspective • Improve the utility of ranked lists • Define suitable criteria for the construction of a valid multilingual Web corpus • Get resources for resource-poor languages [11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

CLIR Communities • TREC Cross Language Track currently focuses on the Arabic language, • Cross-Language Evaluation Forum (CLEF) – a spinoff from TREC - covering many European languages, • NTCIR Asian Language Evaluation (covering Chinese, Japanese and Korean). [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

CLEF In CLEF 2006, eight tracks were offered to evaluate the performance of systems: • multilingual document retrieval on news collections (Ad-hoc) • cross-language structured scientific data (Domain-specific) • interactive cross-language retrieval • multiple language question answering • cross-language retrieval on image collections • cross-language speech retrieval • multilingual web retrieval • cross-language geographic retrieval. [13] Carol Peters, Cross-Language Evaluation Forum - CLEF 2006. D-Lib Magazine October 2006

References [1] Wikipedia, http://en.wikipedia.org/wiki/Cross-language_information_retrieval [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [3] Internet World Stats, http://www.internetworldstats.com/stats7.htm [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [5] Ari Pirkola, et al. Dictionary_Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol. 4. 2001 [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006 [7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004 [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001 [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007 [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000 [11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR [13] Carol Peters, Cross-Language Evaluation Forum - CLEF 2006. D-Lib Magazine October 2006 [14] Boxes and Arrows, http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary [15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent 6885985. Filing date: Dec 18, 2000. Issue date: Apr 26, 2005 [16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

Thank you!

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective