Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24 • References: • Wen-Hsiang Lu (Advisors: Lee-Feng Chien and Hsi-Jian Lee.) (2003) Term Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University.

Outline • Background & Research Problems • Anchor Text Mining for Term Translation Extraction • Transitive Translation for Multilingual Translation • Web Mining for Cross-Language Information Retrieval and Web Search Applications

Part I Background & Research Problems

Motivation • Demands on multilingual translation lexicons • Machine translation (MT) • Cross-language information retrieval (CLIR) • Information exchange in electronic commerce (EC) • Web mining • Explore multilingual and wide-scoped hypertext resources on the Web

Research Problems • Difficulties in automatic construction of multilingual translation lexicons • Techniques: Parallel/comparable corpora • Bottlenecks: Lacking diverse/multilingual resources • Difficulties in query translation for cross-language information retrieval (CLIR) [Fig1] • Techniques: Bilingual dictionary/machine translation/ parallel corpora • Bottlenecks: Multiple-senses/short/diverse/unknown query [Fig2]

Cross-Language Information Retrieval • Query in source language and retrieve relevant documents in target languages Query Translation Target Translation Information Retrieval Source Query Hussein Target Documents 海珊/侯賽因/哈珊/胡笙(TC) 侯赛因/海珊/哈珊(SC)

Difficulties in Query Translation using Machine Translation Systems Chinese translation:全國宮殿博物館 English source query :National Palace Museum

Research Paradigm New approach Live Translation Lexicon Web Mining Anchor-Text Mining Term-Translation Extraction Applications Internet Search-Result Mining Cross-Language Information Retrieval Cross-Language Web Search

Multilingual Anchor Texts & Hyperlink Structure

Language-Mixed Texts in Search Result Pages

Research Results • Anchor text mining for term translation extraction • ACM SIGIR’01(poster), IEEE ICDM’01, ACM Trans. on Asian Language Information Processing 2002 • Reviewers’ encouraging comments • “… the approach seems to be quite novel. To my knowledge, there has not been a proposal of uses of anchor texts like this one.” • Transitive translation for multilingual translation • COLING’02, ACM Trans. on Information Systems (first paper from Taiwan since 1986), ACL’04 • Reviewers’ encouraging comments • “This is a nicely written, technically sound paper that pursues a clever and original idea …” • “… the idea of using anchor texts from the Web to learn cross-lingual information retrieval algorithms is very good …” • “I enjoyed the paper and thought the underlying work was interesting and valuable …”

Research Results (cont.) • Web mining for cross-language Web search • ROCLING’03, ACM SIGIR’04 • Improve precision rate from 0.207 (dictionary-based) to 0.241 on NTCIR-2 Chinese-English CLIR evaluation task • Reviewers’ encouraging comments • “It gives us insight into the value of the Web as a dynamic information source. Although the experiments are restricted to Chinese-English documents, also developers for other languages may find this work stimulating.” • “The idea is interesting, and is relatively new. It may give inspiration to other researchers working in the same area.” • LiveTrans: Experimental CLWS system [LiveTrans]

LiveTrans: Cross-Language Web Search System • http://livetrans.iis.sinica.edu.tw/lt.html[LiveTrans] • Mirror: http://wmmks.csie.ncku.edu.tw/lt.html[LiveTrans] • System functions • Query-translation suggestion • Retrieval of Web pages and images. • Multilingual search: English, traditional Chinese, simplified Chinese, Japanese or Korean • Gloss translation for retrieved page titles • Fusion of retrieval results

Research Results (cont.) • Summary of contributions • Present an innovative approach • Significantly reduce the difficulty of unknown-term translation. • CLIR can be improved especially for short queries. • Develop a practical cross-language Web search engine • Without relying on translation dictionary • A live dictionary with a significant number of multilingual term translations obtained. • Present a new problem for further investigation in Web Mining

Related Research • Automatic extraction of multilingual translations • Statistical translation model (Brown 1993) • Parallel corpus (Melamed 2000; Wu & Chang 2003) • Non-parallel/comparable corpus (Fung 1998; Rapp 1999) • Web mining • Parallel corpus collection (Nie 1999; Resnik 1999) • Comparable corpus collection: Anchor texts and search-result pages (Lu et al. 2002, 2003) • Strength: Huge amounts of Web data with link structure

Related Research (cont.) • Query translation for cross-language information retrieval • Dictionary-/MT-based approach (Ballesteros & Croft 1997; Hull & Grefenstette 1996) • Corpus-based approach (Dumais 1997; Nie 1999) • Combined approach (Chen & Bian 1999; Kwok 2001) • Improving techniques • Query expansion and phrase translation (Ballesteros & Croft 1997) • Translation disambiguation (Ballesteros & Croft 1998; Chen & Bian 1999) • Proper name transliteration (Chen et al. 1998; Lin & Chang 2003) • Probabilistic retrieval/language models (Hiemstra & de Jong 1999; Lavrenko 2002) • Unknown query translation (Lu et al. 2002, 2003)

Related Research (cont.) • Cross-language Web search (CLWS) • Practical CLWS services have not lived up to expectations • Keizai (Ogden et al. 1999): English query/Japanese, Korean Web news • MTIR (Bian & Chen 1999): Chinese query/English pages/translation • MuST: Multilingual Summarization and Translation (Hovy & Lin 1998) • English/Indonesian/Spanish/Arabic/Japanese, Web news summarization or translation • TITAN (Hayashi et al.1997): English-Japanese retrieval/translated pages titles • Challenges • Web queries are often • Short: 2-3 words (Silverstein et al. 1998) • Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic dictionary containing 23,948 entries. • E.g. • Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein) • New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections)

Part IIAnchor Text Mining for Term Translation Extraction

Anchor-Text Set • Anchor text (link text) • The descriptive text of a linkon a Web page • Anchor-text set • A set of anchor texts pointing to the same page (URL) • Multilingual translations • Yahoo/雅虎/야후 • America/美国/アメリカ • Anchor-text-set corpus • A collection of anchor-text sets 야후-USA Korea Yahoo Search Engine Yahoo! America http://www.yahoo.com • アメリカのYahoo! 美国雅虎雅虎搜尋引擎 Japan Taiwan China

Processing of Term Translation Extraction Term Translation Extraction Source Query Term Target Translation Compute similarity using probabilistic inference model. Collect Web pages and build up anchor-text-set corpus. Anchor-Text-Set Corpus Translation Lexicon Term Extraction Anchor-Text Extraction Term Similarity Estimation Web Pages Web Spider Internet Extract key terms as translation candidate.

Example for Term Translation Extraction s: Source Query Term t: Target Translations Term Translation Extraction Yahoo 雅虎 - in USA Yahoo www.yahoo.com (#in-link= 187) ... Set u1 搜尋引擎雅虎 ....... Co-occurrence Yahoo Taiwan - www.yahoo.com.tw (#in-link= 21) Set u2 ... 台灣雅虎 Chinese-English Anchor-Text-Set Corpus Page Authority

Probabilistic Inference Model Conventional translation model • Asymmetric translation models: • Symmetric model with link information: Co-occurrence Page authority

Experimental Environment • Anchor-text-set corpora • 109,416 traditional-Chinese-English sets (from 1,980,816 pages) • 157,786 simplified-Chinese-English sets (from 2,179,171 pages) • Test query set • Query logs: • Dreamer log: 228,566 unique query terms • GAIS log: 114,182 unique query terms • Core terms: 9,709 most popular query terms, frequencies >10 in two logs • Test set: 622 English terms selected from core terms • Average top-n inclusion rate (ATIR)

Performance with Different Estimation Models • Using different models • MA: Asymmetric model • MAL: Asymmetric model with link information • MS: Symmetric model • MSL: Symmetric model with link information • The symmetric inference model with link information was useful to improve the translation accuracy.

Performance with Different Term Extraction Methods and Query-Log-Set Sizes • The query-log-based method achieved better performance. • The medium-sized query-log set achieved the best performance

Performance Comparison • Example: Test term "sakura“ • Query-log set (9,709 terms) • Top 5 extracted translations:台灣櫻花, 櫻花, 蜘蛛網, 純愛, 螢幕保護 • Query-log set (228,566 terms) • Top 10 extracted translations:庫洛魔法使, 櫻花建設, 模仿, 櫻花大戰, 美夕, 台灣櫻花, 櫻花, 蜘蛛網, 純愛, 螢幕保護 • Test results of 9,709 core terms [TTE9709] • Promising results

Part IIITransitive Translation for Multilingual Translation

Transitive Translation for Multilingual Translation • Problem • Insufficient anchor-text-set corpus for certain language pairs • E.g., Chinese-Japanese, Chinese-French, etc. • Goal • A generalized model for multilingual translation • Idea • Transitive translation model: Extract translations via intermediate (third) language, e.g., English (Borin 2000; Gollins & Sanderson 2001) • To reduce interference errors, integrates a competitive linking algorithm.

Transitive Translation: Combining Direct and Indirect Translation • Direct Translation Model • Indirect Translation Model • Transitive Translation Model Direct Translation t s 新力(Traditional Chinese) ソニー (Japanese) Indirect Translation m Sony (English) … s : source term t : target translation m: intermediate translation

Promising Results for Automatic Construction of Multilingual Translation Lexicons

Indirect Association Problem • Indirect association error (Melamed 2000) • t1co-occurs often with s than t • E.g., 思科  system (translation error) 0.11 s 思科 system t1 0.07 Cisco t

Competitive Linking Algorithm • Concepts of competitive linking (CL) algorithm (Melamed 2000) • Determine the most possible translation pairs between source and target sets. • Assumption: each term has only one translation. • Method: • Greedily select the most possible edges. • Select less possible edges when no conflicting with previous selections. • Integration of anchor-text-mining and CL Algorithm • Build a bipartite graph using our proposed translation model. • Use the extended CL algorithm to filter out indirect association errors.

Bipartite Graph Construction S Step 1 Step 2 T s 思科 system t1 s 思科 system t1 Cisco Cisco t2 t2 系統 St1 資訊網路 St2 電腦 Bipartite graph G = (S∪T, E)

Extended Competitive Linking Algorithm • Pick up k most possible translations for a source term Step 2 Step 1 s 思科 0.l1 system t1 s 思科 system t1 0.07 0.23 Cisco Cisco t2 t2 系統系統 0.01 St1 St1 資訊資訊 0.03 網路 0.004 網路 St2 St2 電腦電腦

Construct bipartite graph G = (S∪T, E) Direct_Translation_with_CL (s, U, Vt) Input: source term s Web pages of concern U translation vocabulary set Vt Output: target translation set R Compute edge weight wij Sort wijChoose edge ei*j* with highest weight Y N si* = s ? R = R ∪{tj*} Y |R| = k ? Remove all edges linking to si* or tj* Re-estimate wij for remaining edges N Remove all edges linking to tj* Re-estimate wij for remaining edges N |E| = 0 ? Y Return R

Performance of Proposed Models with CL Algorithm • Test query set: 258 terms (from 9,709 core terms) • Anchor-text-set corpora Traditional Chinese-Simplified Chinese : 4,516 sets Traditional Chinese-English: 109,416 sets Simplified Chinese-English: 157,786 sets • Source/Target/Intermediate languages: Traditional Chinese/Simplified Chinese/English

Effective Translation Using CL Algorithm

Part IVWeb Mining for Cross-Language Information Retrieval and Web Search Applications

Web Mining for Cross-Language Information Retrieval and Web Search Applications • Goal: Web mining to benefit CLIR and CLWS • Mining query translations from the Web • Idea: Integrated Web mining approach • Anchor-text-mining approach • Probabilistic inference model • Transitive translation model • Search-result-mining approach • Chi-square test • Context-vector analysis

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval