Statistical Machine Translation Models for Personalized Search

Statistical Machine Translation Models for Personalized Search Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu Vasudeva Varma, SIEL, LTRC, IIIT Hyderabad, India vv@iiit.ac.in

Agenda • Introduction • Related Work • Background • User Profile as Translation Model • Personalized Search • Learning User Profile • Re-ranking • Experiments • Conclusions and Future Work

Introduction • Current Web Search engines • Provide users with documents “relevant” to their information need • Issues • Information overload • To cater Hundreds of millions of users • Terabytes of data • Poor description of Information need • Short queries - Difficult to understand • Word ambiguities • Users only see top few results • Relevance • subjective – depends on the user One size Fits all ???

Continued.. • Search is not a solved problem! • Poorly described information need • Java – (Java island / Java programming language ) • Jaguar – (cat /car) • Lemur – (animal / lemur tool kit) • SBH – (State bank of Hyderabad/Syracuse Behavioral Health care) • Given prior information • I am into biology – best guess for Jaguar? • past queries - { information retrieval, language modeling } – best guess for lemur?

Review of Personalized Search Personalized Search Query logs Machine learning Language modeling Community based Others

Statistical Language Modeling based Approaches: Introduction • Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language • Applied to a number of problems – Speech, Machine Translation, IR, Summarization

Statistical Language Modeling based Approaches: Background Lemur Query Formulation Model Query Given a query, which is most likely to be the Ideal Document? P(Q/D) = P(q1….qn/D) = ΠP(qi/D) User Information need Ideal Document In spite of the progress, not much work to capture, model and integrate user context !

Noisy Channel based approach Motivation Query Generation Process (Noisy Channel) Ideal Document Retrieval Query Generation Process (Noisy Channel)

Similar to Statistical Machine Translation • Given an english sentence translate into french • Given a query, retrieve documents closer to ideal document Noisy channel 1 English Sentence French Sentence P(e/f) Noisy Channel 2 Ideal Document Query P(q/w)

Learning user profile • User profile: Translation Model Triples : (qw,dw,p(qw/dw)) • Use Statistical Machine Translation methods • Learning user profile training a translation model • In SMT: Training a translation model • From Parallel texts • Using EM algorithm

Learning User profile • Extracting Parallel Texts • From Queries and corresponding snippets from clicked documents • Training a Translation Model • GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research.

Sample user profile

Reranking • Recall, in general LM for IR • Noisy Channel based approach P(Q/D) = Π P(qi/D) lemur P(lemur/retrieval) Lemur encyclopedia … brief … Lemur toolkit … information retireval … Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. D1 : D4:

Experiments • Performed evaluation on explicit feedback data collected from 7 users • Experiments • Comparison with Contextless Ranking • Comparison between different training models and contexts

Data and Set up • Data • Explicit Feedback data collected from 7 users • For each query, each user examined top 10 documents and identified top 10 documents • Collected the top 10 results for all queries. Total documents 3469 documents • Set up • 3469 documents - created lucene index. • For reranking, first retrieve the results using lucene and then rerank them using the noisy channel approach. • We perform 10 fold cross validation

Data

Metrics • Precision@n • Number of documents relevant / n

Set up User Profile Learner Train Data User Profiles Data Test Data Reranker Reranked Results

Results

Results I - Document Training and Document Testing II - Document Training and Snippet Testing III - Snippet Training and Document Testing IV - Snippet Training and Snippet Testing

Conclusions and Future Work • Proposed a stat MT based approach for modeling user model • Captures Richer context, relations between q and w. • In future, • N-gram based method : trigrams etc • Noisy Channel based method : bigram

Questions?

Thank you

References • Adam Berger and John D. Lafferty. 1999. Information retrieval as statistical translation. In Research and Development in Information Retrieval, pages 222–229. • Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–311. • W. Bruce Croft, Stephen Cronen-Townsend, and Victor Larvrenko. 2001. Relevance feedback and personalization: • A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries. • Jamie Allan et. al. 2003. Challenges in information retrieval language modeling. In SIGIR Forum, volume 37 Number 1. • K. Sugiyama K. Hatano and M. Yoshikawa. 2004. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of WWW 2004, page 675 684. • Victor Lavrenko and W. Bruce Croft. 2001. Relevance-based language models. In Research and Development in Information Retrieval, pages 120–127. • F. Liu, C. Yu, and W. Meng. 2002. Personalized web search by mapping user queries to categories. In Proceedings of the eleventh international conference on Information and knowledge management, ACM Press, pages 558–565. • Tom Mitchell. 1997. Machine Learning. McGrawHill.

Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. • Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281. • A. Pretschner and S. Gauch. 1999. Ontology based personalized search. In ICTAI., pages 391–398. • J. J. Rocchio. 1971. Relevance feedback in information retrieval, the smart retrieval system. Experiments in Automatic Document Processing, pages 313–323. • G. Salton and C. Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society of Information Science, 41:288–297. • Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005. Implicit user modeling for personalized search. In Proceedings of CIKM 2005. • F. Song and W. B. Croft. 1999. A general language model for information retrieval. In Proceedings on the 22nd annual international ACM SIGIR conference, page 279280. • Micro Speretta and Susan Gauch. 2004. Personalizing search based on user search histories. In Thirteenth International Conference on Information and Knowledge Management (CIKM 2004). • Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR’01, pages 334–342.

Statistical Machine Translation Models for Personalized Search

Statistical Machine Translation Models for Personalized Search

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation System

Morphological Preprocessing for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Non-deterministic Tree Automata Models for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation Part IV – Log-Linear Models

Statistical Machine Translation

Morphological Processing for Statistical Machine Translation

Statistical Machine Translation Part IV – Log-Linear Models

Statistical Machine Translation

Statistical Machine Translation

Cluster Computing for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation, Statistical Approach

Statistical Machine Translation Part IV – Log-Linear Models

Statistical Machine Translation