1 / 63

Ferhan Ture Dissertation defense May 24 th , 2013

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation. Ferhan Ture Dissertation defense May 24 th , 2013 Department of Computer Science University of Maryland at College Park. Motivation. f orum posts. multi-lingual text.

shiela
Télécharger la présentation

Ferhan Ture Dissertation defense May 24 th , 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation • Ferhan Ture • Dissertation defense • May 24th, 2013 • Department of Computer ScienceUniversity of Maryland at College Park

  2. Motivation • forum posts • multi-lingual text • clustered summaries • user’s native language • Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … • Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages • Goal: Find ways to efficiently and effectively • Search complex, noisy data • Deliver content in appropriate form

  3. Information Retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores. queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58 retrievir find materi (usual document unstructurnatur (usual text satisfi need larg collect (usual store comput work assummateri collect document written naturlanguag need form queri rang word entir document typic approach irrepres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creatnois search process document score relatqueri score queri term independaggreg term-docu score

  4. Cross-Language Information Retrieval Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören). 89,933 2,345 221,932 106,134 92,541 4,073 - - 162,671 78,346 241,580 19,318 5,802 327,094 104,822 23,890 95,936 187,349 9,394 3.4 2.9 2.7 2.5 2.4 2.1 2 1.8 1.8 1.7 1.7 1.5 1.5 1.5 1.4 1.1 1.0 0.9 0.8

  5. Machine Translation Machine translation (MT) is to translate text written in a source language into corresponding text in a target language. Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in entsprechenden Text in der Zielsprache geschrieben übersetzen.

  6. Motivation Cross-language IR • multi-lingual text • user’s native language • MT • Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … • Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages • Goal: Find ways to efficiently and effectively • Search complex, noisy data • Deliver content in appropriate form

  7. Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions

  8. Extracting Parallel Text from the Web Phase 1 source collection F doc vectorsF Preprocess Signature Generation signaturesF Sliding Window Algorithm target collection E doc vectorsE Preprocess Signature Generation signaturesE cross-lingual document pairs Phase 2 candidate sentence pairs Candidate Generation 2-step Parallel Text Classifier aligned bilingual sentence pairs (F-E parallel text)

  9. Pairwise Similarity • Pairwise similarity: • finding similar pairs of documents in a large collection • Challenges • quadratic search space • measuring similarity effectively and efficiently • Focus on recalland scalability

  10. Locality-Sensitive Hashing Ne English articles Preprocess Similar article pairs Ne English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures

  11. Locality-Sensitive Hashing (Ravichandran et al., 2005) • LSH(vector) = signature • faster similarity computation s.t. similarity(vector pair) ≈ similarity(signature pair) e.g., • ~20 times faster than computing (cosine) similarity from vectors • similarity error ≈ 0.03 • Sliding window algorithm • approximate similarity search based on LSH • linear run-time

  12. Sliding window algorithm Generating tables permute sort list1 table1 p1 …. 11111101010,1 10011000110,2 01100100100,3 … …. 01100100100,1 10011000110,2 11111101010,3 … . . . . . . Signatures …. 1,11011011101 2,01110000101 3,10101010000 … sort tableQ listQ pQ …. 11111001011,1 00101001110,2 10010000101,3 … …. 00101001110,1 10010000101,2 11111001011,3 … Map Reduce

  13. Sliding window algorithm Detecting similar pairs 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011 table1 …. 01100100100,1 10011000110,2 11111101010,3 … . . . tableQ Map

  14. Sliding window algorithm Example # tables = 2 window size = 2 list1 table1 p1 ✗ Distance(3,2) = 7 Distance(2,1) = 5 # bits = 11 ✓ (<1,11111101010>,1) (<1,10011000110>,2) (<1,01100100100>,3) (<1,01100100100>,3) (<1,10011000110>,2) (<1,11111101010>,1) Signatures list2 table2 p2 (1,11011011101) (2,01110000101) (3,10101010000) ✗ Distance(2,3) = 7 Distance(3,1) = 6 ✓ (<2,11111001011>,1) (<2,00101001110>,2) (<2,10010000101>,3) (<2,00101001110>,2) (<2,10010000101>,3) (<2,11111001011>,1) Map Reduce

  15. MT Cross-lingual Pairwise Similarity English German MT translate Doc A doc vector vA English Doc B doc vector vB CLIR German CLIR translate Doc A doc vector vA doc vector vA English Doc B doc vector vB 16

  16. MT vs. CLIR for Pairwise Similarity clir-neg clir-pos mt-neg mt-pos positive-negative clearly separated low similarity values MT slightly better than CLIR, but 600 times slower!

  17. Locality-Sensitive Hashing for Pairwise Similarity Ne English articles Preprocess Similar article pairs Ne English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures

  18. Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity Nf German articles Ne English articles CLIR Translate Preprocess Similar article pairs Ne English document vectors Ne+Nf English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures

  19. Evaluation # bits (D) = 1000 # tables (Q) = 100-1500 window size (B) = 100-2000 • Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia • Collection: 3.44m En + 1.47m De Wikipedia articles • Task: For each German Wikipedia article, find: {all English articles s.t. cosine similarity > 0.30}

  20. Scalability

  21. Evaluation Sliding window algorithm Signature generation Similar article pairs Signatures document vectors algorithm output two sources of error ground truth Similar article pairs Brute-force approach Signatures Similar article pairs Brute-force approach upperbound document vectors

  22. Evaluation 100% recall no savings = no free lunch! 95% recall 39% cost 99% recall 70% cost 99% recall 62% cost 95% recall 40% cost

  23. Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions

  24. Phase 2: Extracting Parallel Text Approach • Generate candidate sentence pairs from each document pair • Classify each candidate as ‘parallel’ or ‘not parallel’ Challenge:10s millions doc pairs ≈ 100s billions sentence pairs Solution: 2-step classification approach • a simple classifier efficiently filters out irrelevant pairs • a complex classifier effectively classifies remaining pairs

  25. Parallel Text (Bitext) Classifier • cosine similarityof the two sentences • sentence length ratio:the ratio of lengths of the two sentences • word translation ratio:ratio of words in source (target) sentence with a translation in target (source) sentence

  26. Bitext Extraction Algorithm cross-lingual document pairs candidate generation 2.4 hours sentences and sent. vectors sentence detection+tf-idf source document target document 400 billion 214 billion shuffle&sort 1.3 hours MAP cartesian product REDUCE X complex classification 0.5 hours sentence pairs 132 billion simple classification 4.1 hours simple classification bitext S2 complex classification bitextS1

  27. Extracting Bitext from Wikipedia

  28. Evaluation on MT

  29. Evaluation on MT

  30. Conclusions (Part I) • Summary • Scalable approach to extract parallel text from a comparable corpus • Improvements over state-of-the-art MT baseline • General algorithm applicable to any data format • Future work • Domain adaptation • Experimenting with larger web collections

  31. Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions

  32. Cross-Language Information Retrieval (ranked) documents query • Information Retrieval (IR): Given information need, find relevant material. • Cross-language IR (CLIR): query and documents in different languages • “Why does China want to import technology to build Maglev Railway?” • relevant information in Chinesedocuments • “Maternal Leave in Europe” • relevant information in French, Spanish, German, etc.

  33. Machine Translation for CLIR sentence-aligned parallel corpus STATISTICAL MT SYSTEM token aligner token alignments grammar extractor query token translation probabilities “maternal leave in Europe” translation grammar decoder language model language model 1-best translation “congé de maternité en Europe” n best translations

  34. Token-based CLIR … most leave their children in … ... aim of extending maternity leave to … . . . … la plupart laisse leurs enfants… … l’objectif de l’extension des congé de maternité à … . . . Token-based probabilities Token translation formula

  35. Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% …

  36. Document Retrieval Query q1 “maternal leave in Europe” Document Document Document Document d1 [maternité: 0.74, maternel: 0.26] tf(maternel) How to score a document, given a query? tf(maternité) df(maternité) df(maternel) …

  37. Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% …

  38. Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% … laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% …

  39. Context-Sensitive CLIR Maternal leave in Europe 12% 70% 6% 5% laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% … laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% … This talk: MT for context-sensitive CLIR

  40. Previous approach: MT as black box Our approach: Looking inside the box Previous approach: Token-based CLIR sentence-aligned parallel corpus MT STATISTICAL MT SYSTEM token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model language model 1-best translation “congé de maternité en Europe” n best derivations n best derivations

  41. MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations

  42. CLIR from translation grammar S  [X : X] , 1.0 X  [X1leave ineurope: congé de X1eneurope] , 0.9 X  [maternal : maternité] , 0.9 X  [X1leave : congé de X1] , 0.74 X  [leave : congé ] , 0.17 X  [leave : laisser] , 0.49 ... S1 S1 X1 X1 Grammar-based probabilities Synchronous Context-Free Grammar (SCFG) [Chiang, 2007] X2 leavein Europe Synchronous hierarchical derivation Token translation formula congé de en Europe X2 maternal maternité

  43. MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations

  44. MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations

  45. CLIR from n-best derivations t(2): { , 0.11 } t(1): { , 0.8 } S1 X1 en Europe . . . congé de maternité t(k): { kth best derivation , score(t(k)|s) } • Token translation formula S1 S1 S1 X1 X1 X1 in Europe Translation-based probabilities maternal leave leavein Europe X2 congé de en Europe X2 maternal maternité

  46. MT for Context-Sensitive CLIR n best derivations sentence-aligned bitext translation grammar token alignments 1-best translation MT pipeline 1-best MT Prnbest Context sensitivity PrSCFG translation based Prtoken grammar based Ambiguity preserved token based

  47. Combining Evidence PrSCFG Prnbest Prtoken leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.09 congé 0.90 quitter 0.11 … 40% For best results, we compute an interpolated probability distribution: 35% 25% Printerp leave laisser 0.33 congé 0.54 quitter 0.8 …

  48. Combining Evidence PrSCFG Prnbest Prtoken leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.09 congé 0.90 quitter 0.11 … 0% For best results, we compute an interpolated probability distribution: 100% 0% leave laisser 0.72 congé 0.10 quitter 0.09 … Printerp

  49. Combining Evidence For best results, we compute an interpolated probability distribution:

More Related