Personalized Web Search using Clickthrough History

Personalized Web Search using Clickthrough History U. Rohini 200407019 rohini@research.iiit.ac.in Language Technologies Research Center (LTRC) International Institute of Information Technology (IIIT) Hyderabad, India

Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web Search • Personalized Search using user Relevance Feedback: Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Personalized Search using user Relevance Feedback: Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback: Simple Statistical Language modeling based method • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions

Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web Search • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions

Introduction • Current Web Search engines • Provide users with documents “relevant” to their information need • Issues • Information overload • To cater Hundreds of millions of users • Terabytes of data • Poor description of Information need • Short queries - Difficult to understand • Word ambiguities • Users only see top few results • Relevance • subjective – depends on the user One size Fits all ???

Motivation • Search is not a solved problem! • Poorly described information need • Java – (Java island / Java programming language ) • Jaguar – (cat /car) • Lemur – (animal / lemur tool kit) • SBH – (State bank of Hyderbad/Syracuse Behavioral Health care) • Given prior information • I am into biology – best guess for Jaguar? • past queries - { information retrieval, language modeling } – best guess for lemur?

Background • Prior Information – user feedback

Problem Description • Personalized Search • Customize search results according to each individual user • Personalized Search - Issues • What to use to Personalize? • How to Personalize? • When not to Personalize? • How to know Personalization helped?

Problem Statement • Problem: How to Personalize? • Our Direction: • Use past Search history • Long term learning • Sub Problems Broken down into 2 sub problems • How to model and represent past search contexts • How to use it to improve search results

Solution Outline 1. How to model and represent past search contexts • Past search history from user over a period of time – query logs • User contexts – triples : {user,query,{relevant documents}} • Apply appropriate method, learn from user contexts, build model – user profile User Profile Learning 2. How to use it to improve search results • Get Initial Search results • Take top few documents, re-score using user profile and sort again Reranking

Contributions • I Search : A suite of approaches for Personalized Web Search • Proposed Personalized search approaches • Baseline • Basic Retrieval methods • Automatic Evaluation • Analysis of Query Log • Creating Simulated Feedback

Review of Personalized Search Personalized Search Query logs Machine learning Language modeling Community based Others

I Search : A suite of approaches for Personalized Search • Suite of Approaches • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel Model based method • Machine learning based approach • Ranking SVM based method • Personalization without relevance feedback • Simple N-gram based method

Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple Language model based method • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions

Statistical Language Modeling based Approaches: Introduction • Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language • Applied to a number of problems – Speech, Machine Translation, IR, Summarization

Statistical Language Modeling based Approaches: Background Lemur Query Formulation Model Query Given a query, which is most likely to be the Ideal Document? User Information need Ideal Document In spite of the progress, not much work to capture, model and integrate user context !

Motivation for our approach Ideal document Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. Information retrieval User Past Search Contexts Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which

Statistical Language Modeling based Approaches : Overview • From user contexts, capture statistical properties of texts • Use the same to improve search results • Different Contexts • Unigram and Bigrams • Simple N-gram based approaches • Relationship between query and document words • Noisy Channel based approach

Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions

N-gram based Approaches: Motivation Ideal document Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. Information retrieval Past Search Contexts Unigrams Information Retrieval Documents … Bigrams Information retrieval Searching documents Information documents … Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which

Sample user profile

Learning user profile Given Past search history Hu = {(q1, rf1), (q2, rf2), …, (qn, rfn)} • rfall = contentation of all rf • For each unigram wi • User profile

Reranking • Recall, in general LM for IR • Our Approach

Noisy Channel based Approach • Documents and Queries different information spaces • Queries – short, concise • Documents – more descriptive • Most methods to retrieval or personalized web search do not model this • We capture relationship between query and document words

Noisy Channel based approach Motivation Query Generation Process (Noisy Channel) Ideal Document Retrieval Query Generation Process (Noisy Channel)

Similar to Statistical Machine Translation • Given an english sentence translate into french • Given a query, retrieve documents closer to ideal document Noisy channel 1 English Sentence French Sentence P(e/f) Noisy Channel 2 Ideal Document Query P(q/w)

Learning user profile • User profile: Translation Model Triples : (qw,dw,p(qw/dw)) • Use Statistical Machine Translation methods • Learning user profile training a translation model • In SMT: Training a translation model • From Parallel texts • Using EM algorithm

Learning User profile • Extracting Parallel Texts • From Queries and corresponding snippets from clicked documents • Training a Translation Model • GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research. U. Rohini, Vamshi Ambati, and Vasudeva Varma. Statistical machine translation models for personalized search. Technical report, International Institute of Information Technology, 2007

Reranking • Recall, in general LM for IR • Noisy Channel based approach lemur P(retrieval/lemur) Lemur encyclopedia … brief … Lemur toolkit … information retireval … Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. D1 : D4:

Machine Learning based Approaches:Introduction • Most machine learning for IR - Binary classification problem – “relevant” and “non-relevant” • Click through data • Click is not an absolute relevance but relative relevance • i.e., assuming clicked – relevant, un clicked - irrelevant is wrong. • Clicks – biased • Partial relative relevance - Clicked documents are more relevant than the un clicked documents.

Background • Ranking SVM • A variation of SVM • Learns from Partial Relevance Data • Learning similar to classification SVM

Ranking SVMs based method • Use Ranking SVMs for learning user profile • Experimented • Different features • Unigram, bigram • Different Feature weights • Boolean, Term Frequency, Normalized Term Frequency

Learning user profile • User profile : a weight vector • Learning: Training an SVM Model • Steps • Extracting Features • Computing Feature Weights • Training SVM 1. Uppuluri R, Ambati V, Improving web search results using collaborative filtering, In proceedings of 3rd International Workshop on Web Personalization (ITWP), held in conjunction with AAAI 2006, 2006. 2. U. Rohini and Vasudeva Varma. A novel approach for re-ranking of search results using collaborative filtering. In Proceeedings of International Conference on Computing: Theory and Applications (ICCTA’07), pages 491–495, Kolkota, India, March 2007

Extracting Features • Features : unigram, bigram Given Past search history Hu = {(q1, rf1), (q2, rf2), …, (qn, rfn)} rfall = contentation of all rf • Remove stop words from rfall • Extract all unigrams (or bigrams) from rfall

Computing Feature Weights • In each Relevant Document (di), compute weights of features: • Boolean Weighting • 1 or 0 • Term Frequency Weighting • tfw – Number of times it occurs in di • Normalized Term Frequency Weighting • tfw/ |di| |Q|

Training SVM • Each relevant document – represent as a string of features and corresponding weights • We used SVMlight for training

Sample Training Sample User Profile

Reranking • Sim(Q,D) = W. Ф(Q,D) • W – weight vector/user profile • Ф(Q,D) – vector of term and their weights • Measure of similarity between Q and D • Each term – term in the query • Term weight – product of weights in the query and the document (boolean, term frequency,normalized term frequency)

Personalized Search without Relevance Feedback:Introduction • Can personalized be done without relevance feedback about which documents are relevant • How much informative are the queries posed by users • Is information contained in the queries enough to personalize?

Approach • Past queries of the user available • Make effective use of past queries • Simple N-gram based approach

Learning user profile Given Past search history Hu = {q1 q2, qn } qconcat : Concatenation of all queries For each unigram wi • User profile

Reranking • In general LM for IR • Our Approach U. Rohini, Vamshi Ambati, and Vasudeva Varma. Personalized search without relevance feedback. Technical report, International Institute of Information Technology, 2007

Experiments: Introduction, Problems • Aim: To see how they perform by comparing it with a baseline • Problems • No standard evaluation framework • Data • Lack of standardization • Comparison with previous work difficult • Difficult to repeat previously conducted experiments • Difficult to share results and observations • Repeating effort to collect data over and over • Identified as a problem and need for standardization (Allan et al. 2003) • Lack of standard personalized search baselines • In our work, used a variation of the Rocchio Algorithm • Metrics

Personalized Web Search using Clickthrough History

Personalized Web Search using Clickthrough History

Presentation Transcript

Personalized Ontologies for Web Search and Caching

Optimizing Search Engines using Clickthrough Data

Personalized Ranking Model Adaptation for Web Search

Personalized Image Search

Using the Search History Feature

Personalizing Web Search using Long Term Browsing History

Optimizing search engines using clickthrough data

Personalized Search

Web-Page Summarization Using Clickthrough Data*

WebPage Summarization Using Clickthrough Data

Search Engine using Web Mining

Personalized Search

Clustering Personalized Web Search Results

Personalized Search

INFSCI 2955 Adaptive Web Systems Session 1: Personalized Web Search

Scaling Personalized Web Search

“Personalized Search”

A Framework for Privacy Enhancing Personalized Web Search

Personalized Web Search using Clickthrough History

Personalized Search