1 / 65

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24. References:

york
Télécharger la présentation

Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24 • References: • Wen-Hsiang Lu (Advisors: Lee-Feng Chien and Hsi-Jian Lee.) (2003) Term Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University.

  2. Outline • Background & Research Problems • Anchor Text Mining for Term Translation Extraction • Transitive Translation for Multilingual Translation • Web Mining for Cross-Language Information Retrieval and Web Search Applications

  3. Part I Background & Research Problems

  4. Motivation • Demands on multilingual translation lexicons • Machine translation (MT) • Cross-language information retrieval (CLIR) • Information exchange in electronic commerce (EC) • Web mining • Explore multilingual and wide-scoped hypertext resources on the Web

  5. Research Problems • Difficulties in automatic construction of multilingual translation lexicons • Techniques: Parallel/comparable corpora • Bottlenecks: Lacking diverse/multilingual resources • Difficulties in query translation for cross-language information retrieval (CLIR) [Fig1] • Techniques: Bilingual dictionary/machine translation/ parallel corpora • Bottlenecks: Multiple-senses/short/diverse/unknown query [Fig2]

  6. Cross-Language Information Retrieval • Query in source language and retrieve relevant documents in target languages Query Translation Target Translation Information Retrieval Source Query Hussein Target Documents 海珊/侯賽因/哈珊/胡笙(TC) 侯赛因/海珊/哈珊(SC)

  7. Difficulties in Query Translation using Machine Translation Systems Chinese translation:全國宮殿博物館 English source query :National Palace Museum

  8. Research Paradigm New approach Live Translation Lexicon Web Mining Anchor-Text Mining Term-Translation Extraction Applications Internet Search-Result Mining Cross-Language Information Retrieval Cross-Language Web Search

  9. Multilingual Anchor Texts & Hyperlink Structure

  10. Language-Mixed Texts in Search Result Pages

  11. Research Results • Anchor text mining for term translation extraction • ACM SIGIR’01(poster), IEEE ICDM’01, ACM Trans. on Asian Language Information Processing 2002 • Reviewers’ encouraging comments • “… the approach seems to be quite novel. To my knowledge, there has not been a proposal of uses of anchor texts like this one.” • Transitive translation for multilingual translation • COLING’02, ACM Trans. on Information Systems (first paper from Taiwan since 1986), ACL’04 • Reviewers’ encouraging comments • “This is a nicely written, technically sound paper that pursues a clever and original idea …” • “… the idea of using anchor texts from the Web to learn cross-lingual information retrieval algorithms is very good …” • “I enjoyed the paper and thought the underlying work was interesting and valuable …”

  12. Research Results (cont.) • Web mining for cross-language Web search • ROCLING’03, ACM SIGIR’04 • Improve precision rate from 0.207 (dictionary-based) to 0.241 on NTCIR-2 Chinese-English CLIR evaluation task • Reviewers’ encouraging comments • “It gives us insight into the value of the Web as a dynamic information source. Although the experiments are restricted to Chinese-English documents, also developers for other languages may find this work stimulating.” • “The idea is interesting, and is relatively new. It may give inspiration to other researchers working in the same area.” • LiveTrans: Experimental CLWS system [LiveTrans]

  13. LiveTrans: Cross-Language Web Search System • http://livetrans.iis.sinica.edu.tw/lt.html[LiveTrans] • Mirror: http://wmmks.csie.ncku.edu.tw/lt.html[LiveTrans] • System functions • Query-translation suggestion • Retrieval of Web pages and images. • Multilingual search: English, traditional Chinese, simplified Chinese, Japanese or Korean • Gloss translation for retrieved page titles • Fusion of retrieval results

  14. Research Results (cont.) • Summary of contributions • Present an innovative approach • Significantly reduce the difficulty of unknown-term translation. • CLIR can be improved especially for short queries. • Develop a practical cross-language Web search engine • Without relying on translation dictionary • A live dictionary with a significant number of multilingual term translations obtained. • Present a new problem for further investigation in Web Mining

  15. Related Research • Automatic extraction of multilingual translations • Statistical translation model (Brown 1993) • Parallel corpus (Melamed 2000; Wu & Chang 2003) • Non-parallel/comparable corpus (Fung 1998; Rapp 1999) • Web mining • Parallel corpus collection (Nie 1999; Resnik 1999) • Comparable corpus collection: Anchor texts and search-result pages (Lu et al. 2002, 2003) • Strength: Huge amounts of Web data with link structure

  16. Related Research (cont.) • Query translation for cross-language information retrieval • Dictionary-/MT-based approach (Ballesteros & Croft 1997; Hull & Grefenstette 1996) • Corpus-based approach (Dumais 1997; Nie 1999) • Combined approach (Chen & Bian 1999; Kwok 2001) • Improving techniques • Query expansion and phrase translation (Ballesteros & Croft 1997) • Translation disambiguation (Ballesteros & Croft 1998; Chen & Bian 1999) • Proper name transliteration (Chen et al. 1998; Lin & Chang 2003) • Probabilistic retrieval/language models (Hiemstra & de Jong 1999; Lavrenko 2002) • Unknown query translation (Lu et al. 2002, 2003)

  17. Related Research (cont.) • Cross-language Web search (CLWS) • Practical CLWS services have not lived up to expectations • Keizai (Ogden et al. 1999): English query/Japanese, Korean Web news • MTIR (Bian & Chen 1999): Chinese query/English pages/translation • MuST: Multilingual Summarization and Translation (Hovy & Lin 1998) • English/Indonesian/Spanish/Arabic/Japanese, Web news summarization or translation • TITAN (Hayashi et al.1997): English-Japanese retrieval/translated pages titles • Challenges • Web queries are often • Short: 2-3 words (Silverstein et al. 1998) • Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic dictionary containing 23,948 entries. • E.g. • Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein) • New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections)

  18. Part IIAnchor Text Mining for Term Translation Extraction

  19. Anchor-Text Set • Anchor text (link text) • The descriptive text of a linkon a Web page • Anchor-text set • A set of anchor texts pointing to the same page (URL) • Multilingual translations • Yahoo/雅虎/야후 • America/美国/アメリカ • Anchor-text-set corpus • A collection of anchor-text sets 야후-USA Korea Yahoo Search Engine Yahoo! America http://www.yahoo.com • アメリカのYahoo! 美国雅虎 雅虎搜尋引擎 Japan Taiwan China

  20. Processing of Term Translation Extraction Term Translation Extraction Source Query Term Target Translation Compute similarity using probabilistic inference model. Collect Web pages and build up anchor-text-set corpus. Anchor-Text-Set Corpus Translation Lexicon Term Extraction Anchor-Text Extraction Term Similarity Estimation Web Pages Web Spider Internet Extract key terms as translation candidate.

  21. Example for Term Translation Extraction s: Source Query Term t: Target Translations Term Translation Extraction Yahoo 雅虎 - in USA Yahoo www.yahoo.com (#in-link= 187) ... Set u1 搜尋引擎 雅虎 ....... Co-occurrence Yahoo Taiwan - www.yahoo.com.tw (#in-link= 21) Set u2 ... 台灣 雅虎 Chinese-English Anchor-Text-Set Corpus Page Authority

  22. Probabilistic Inference Model Conventional translation model • Asymmetric translation models: • Symmetric model with link information: Co-occurrence Page authority

  23. Experimental Environment • Anchor-text-set corpora • 109,416 traditional-Chinese-English sets (from 1,980,816 pages) • 157,786 simplified-Chinese-English sets (from 2,179,171 pages) • Test query set • Query logs: • Dreamer log: 228,566 unique query terms • GAIS log: 114,182 unique query terms • Core terms: 9,709 most popular query terms, frequencies >10 in two logs • Test set: 622 English terms selected from core terms • Average top-n inclusion rate (ATIR)

  24. Performance with Different Estimation Models • Using different models • MA: Asymmetric model • MAL: Asymmetric model with link information • MS: Symmetric model • MSL: Symmetric model with link information • The symmetric inference model with link information was useful to improve the translation accuracy.

  25. Performance with Different Term Extraction Methods and Query-Log-Set Sizes • The query-log-based method achieved better performance. • The medium-sized query-log set achieved the best performance

  26. Performance Comparison • Example: Test term "sakura“ • Query-log set (9,709 terms) • Top 5 extracted translations:台灣櫻花, 櫻花, 蜘蛛網, 純愛, 螢幕保護 • Query-log set (228,566 terms) • Top 10 extracted translations:庫洛魔法使, 櫻花建設, 模仿, 櫻花大戰, 美夕, 台灣櫻花, 櫻花, 蜘蛛網, 純愛, 螢幕保護 • Test results of 9,709 core terms [TTE9709] • Promising results

  27. Part IIITransitive Translation for Multilingual Translation

  28. Transitive Translation for Multilingual Translation • Problem • Insufficient anchor-text-set corpus for certain language pairs • E.g., Chinese-Japanese, Chinese-French, etc. • Goal • A generalized model for multilingual translation • Idea • Transitive translation model: Extract translations via intermediate (third) language, e.g., English (Borin 2000; Gollins & Sanderson 2001) • To reduce interference errors, integrates a competitive linking algorithm.

  29. Transitive Translation: Combining Direct and Indirect Translation • Direct Translation Model • Indirect Translation Model • Transitive Translation Model Direct Translation t s 新力(Traditional Chinese) ソニー (Japanese) Indirect Translation m Sony (English) … s : source term t : target translation m: intermediate translation

  30. Promising Results for Automatic Construction of Multilingual Translation Lexicons

  31. Indirect Association Problem • Indirect association error (Melamed 2000) • t1co-occurs often with s than t • E.g., 思科  system (translation error) 0.11 s 思科 system t1 0.07 Cisco t

  32. Competitive Linking Algorithm • Concepts of competitive linking (CL) algorithm (Melamed 2000) • Determine the most possible translation pairs between source and target sets. • Assumption: each term has only one translation. • Method: • Greedily select the most possible edges. • Select less possible edges when no conflicting with previous selections. • Integration of anchor-text-mining and CL Algorithm • Build a bipartite graph using our proposed translation model. • Use the extended CL algorithm to filter out indirect association errors.

  33. Bipartite Graph Construction S Step 1 Step 2 T s 思科 system t1 s 思科 system t1 Cisco Cisco t2 t2 系統 St1 資訊 網路 St2 電腦 Bipartite graph G = (S∪T, E)

  34. Extended Competitive Linking Algorithm • Pick up k most possible translations for a source term Step 2 Step 1 s 思科 0.l1 system t1 s 思科 system t1 0.07 0.23 Cisco Cisco t2 t2 系統 系統 0.01 St1 St1 資訊 資訊 0.03 網路 0.004 網路 St2 St2 電腦 電腦

  35. Construct bipartite graph G = (S∪T, E) Direct_Translation_with_CL (s, U, Vt) Input: source term s Web pages of concern U translation vocabulary set Vt Output: target translation set R Compute edge weight wij Sort wijChoose edge ei*j* with highest weight Y N si* = s ? R = R ∪{tj*} Y |R| = k ? Remove all edges linking to si* or tj* Re-estimate wij for remaining edges N Remove all edges linking to tj* Re-estimate wij for remaining edges N |E| = 0 ? Y Return R

  36. Performance of Proposed Models with CL Algorithm • Test query set: 258 terms (from 9,709 core terms) • Anchor-text-set corpora Traditional Chinese-Simplified Chinese : 4,516 sets Traditional Chinese-English: 109,416 sets Simplified Chinese-English: 157,786 sets • Source/Target/Intermediate languages: Traditional Chinese/Simplified Chinese/English

  37. Effective Translation Using CL Algorithm

  38. Part IVWeb Mining for Cross-Language Information Retrieval and Web Search Applications

  39. Web Mining for Cross-Language Information Retrieval and Web Search Applications • Goal: Web mining to benefit CLIR and CLWS • Mining query translations from the Web • Idea: Integrated Web mining approach • Anchor-text-mining approach • Probabilistic inference model • Transitive translation model • Search-result-mining approach • Chi-square test • Context-vector analysis

More Related