210 likes | 424 Vues
Random Walks on the Click Graph. Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007. Introduction (1/2). A search engine can track which of its search results were clicked for which query
E N D
Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007
Introduction (1/2) • A search engine can track which of its search results were clicked for which query • Click records of query-document pairs can be viewed as a weak indication of relevance • The user decided to at least view the document, based on its description in the search results • We can use the clicks of past users to improve the current search results • The clicked set of documents is likely to differ from the current user’s relevance set
Introduction (2/2) • From the perspective of a user conducting a search: • Documents that are clicked but not relevant constitute noise • Documents that are relevant but not clicked constitute sparsity in the click data • Power law distribution: most queries in the click log have a small number of clicked documents • This paper focuses on the sparsity problem by giving a Markov random walk model, although the model also has noise reduction properties
Algorithm on the Click Graph • The current model uses click data alone, without considering document content or query content • The click graph: • Bipartite • Two types of nodes: queries and documents • An edge connects a query and a document if a click for that query-document pair is observed • The edge may be weighted according to the total number of clicks from all users
Application Areas for Algorithms on Click Graph • Query-to-document ‘search’ • Given a query, find relevant documents, as in ad hoc search • Query-to-query ‘suggestion’ • Given a query, find other queries that the user might like to run • Document-to-query ‘annotation’ • Given a document, attach related queries to it • Document-to-document ‘relevance feedback’ • Given an example document that is relevant to the user, find additional relevant documents
Random Walk Model • A basic query formulation model • Imagine a document (information need) • Think of a query associated with the document • Issue the query or imagine another document related to the query • Iterative thought process (noise process) • A Markov random walk which describes a probability distribution over queries • The retrieval model is obtained by inverting the query formulation model • Starts from an observed query, and attempts to undo the noise, inferring the underlying information need • Backward walks
Random Walk Computation • Cjk: click counts associating node j to k • Define transition probabilities Pt+1|t(k|j) from j to k s is the self-transition probability, which corresponds to the user favoring the current query or document • Transition matrix [A]jk= Pt+1|t(k|j) Pt|0(k|j)=[At]jk • A measure of the volume of paths between j and k
Random Walk Model for Retrieval • Backward random walk for retrieval: Given that we ended a t-step walk at node j, we find the probability of starting at node k, P0|t(k|j) Bayes rule: P0|t(k|j) = Pt|0(j|k)P0 (k)╱Pt(j), assumingP0 (k)=1/N and Pt(j) = Σi[At]ij P0|t(k|j) = [AtZ-1]kj where Z is diagonal and Zjj= Σi[At]ij • Forward random walk: Pt|0(k|j) = [vj.At]k
Forward vs. Backward Walks • PageRank: a query-independent forward random walk on the link graph, which proceeds to its stationary distribution • In statistics, the backward walk model is referred to as diagnostic, and in contrast, the forward walk model is predictive • When t → ∞: • The forward random walk approaches the stationary distribution • Gives high probability to nodes with large number of clicks • The backward random walk approaches the prior starting distribution, which we have taken to be uniform
Clustering Effect • Given an end node that is part of a cluster, we have similar probabilities of having started the walk from any node in the cluster
Walk Parameters Figure: Probability distribution of non-self transitions under different combinations of t and s
Experiment Data • A 14-day click log of web image search engines • Judged images with distance 1 from the query had precision of 75% • Pruning: remove URLs only connected to one query and remove queries that only connected to one URL • After pruning: 505,000 URLs, 202,000 queries and 1.1 million edges • Uniformly sampling 45 queries for evaluation • TREC-style pooling relevance judgments of depth 20 • 2278 relevance judgments identify 818 relevant images
Experiment Result-1 Table 1. The furthest node from any of our test queries is at distance 41 (‘101-0.9-backward’). ‘dist’ and ‘1-0-forward’ are the baselines.
Experiment Result-2 Figure: The number of images retrieved at different distances from the query for each method. The 101-step walk with zero-self-transition possibly goes too far, returning too few distance-1 images.
Experiment Result-3 Figure: The precision at different distances from the query for each method.
Experiment Result-4 Figure: Precision-recall curves of forward and backward walks, with zero self-transition probability (1000 URLs retrieved)
Experiment Result-5 Figure: Parameter sensitivity for a backwards walk. Each contour shows a 0.01 variation in MAP@20. Grid intersections indicate the parameter combinations tried. The large plateau has the highest MAP@20 (0.56-0.57)
Conclusion • We have applied a Markov random walk model to the click graph, giving us a high-quality ranking of documents for a given query, including those as-yet unclicked for that query • A backward walk was more effective than a forward walk, which supports the notion underlying our backward walk • We got the best results from a walk of 11 steps, or 101 steps with high self-transition probability • We have studied ad hoc retrieval in this paper and the model could be effective and easily applied in the applications listed • Given our model, another possible step would be to incorporate document content and query content, by incorporating a language model, aiming to find document that are not yet part of the click graph