320 likes | 405 Vues
Quicklink Selection for Navigational Query Results. Deepayan Chakrabarti (deepay@yahoo-inc.com) Ravi Kumar (ravikuma@yahoo-inc.com) Kunal Punera (kpunera@yahoo-inc.com). What are quicklinks. Result Website. Quicklinks.
E N D
Quicklink Selection for Navigational Query Results Deepayan Chakrabarti (deepay@yahoo-inc.com) Ravi Kumar (ravikuma@yahoo-inc.com) Kunal Punera (kpunera@yahoo-inc.com)
What are quicklinks Result Website Quicklinks
Quicklinks Result Website • Quicklinks = URLs within the search result website • Enable fast navigation to important parts of the website • Which URLs should be QLs? Quicklinks
Quicklink Selection • Some obvious strategies don’t work very well • Top clicked URLs in search engine • URL may have low relevance in the QL context • lib.utexas.edu/maps is popular for searches on “maps” and not for searches on “Univ. of Texas” • URL may be too specific: • automobiles.honda.com/civic-hybrid/exterior-photos.aspx for honda.com • URL popularity be time sensitive: • nytimes.com/election-guide/2008/ for nytimes.com
Quicklink Selection • Some obvious strategies don’t work very well • Top clicked URLs in search engine • Top visited URLs intoolbar data • May not relate to search activity:e.g., for nytimes.com • #3 is nytimes.com/mem/emailthis.html • #6 isnytimes.com/auth/login • #8 isnytimes.com/gst/regi.html
Quicklink Selection • Some obvious strategies don’t work very well • Top clicked URLs in search engine • Top visited URLs in toolbar data • Top URLs from analysis of hyperlink graph • Ignores preferences of search users • Toolbar data is more representative • Heavily tagged URLs (e.g., del.icio.us/digg) • Low coverage: Too few websites
Quicklink Selection • Need a combined approach • Search logs • Toolbar data • Web-server logs • Website hyperlink graph • User tags This paper
Related Work • Sitemap generation [Perkowitz+/00] • Detection of hard-to-find URLs [Srikant+/01] • Improving website navigability [Doerr+/07] • Mining Web usage patterns [Buchner/99, Cadez+/03] • BrowseRank [Liu+/08] • Post-search browsing behavior [Bilenko+/08] We focus on QLs in the context of Search
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Problem Formulation • Which k URLs should be QLs? “The greatest good for the greatest number” • QLs save clicks • Maximize the total number of clicks saved using at most k QLs • But when exactly is a click “saved”?
Problem Formulation • When does a QL get clicked by the user? Say we pick this node as a QL nasa.gov Hubble telescope Photos Graph of click trails (Toolbar data)
Problem Formulation Say we pick this node as a QL nasa.gov Hubble telescope Photos Graph of click trails (Toolbar data) Assumption:The user recognizes if SearchResult QL Destination
Problem Formulation nasa.gov (saves 1 click each) Say we pick this node as a QL Graph of click trails (Toolbar data) Assumption:The user recognizes if SearchResult QL Destination
Problem Formulation nasa.gov (saves 1 click each) (saves 0) Say we pick this node as a QL (saves 0) (saves 2 clicks each) Total savings = 1*3 + 2*2 = 7 clicks Graph of click trails (Toolbar data) Assumption:The user recognizes if SearchResult QL Destination
Problem Formulation • However… • Unknown pages might become QLs lyrics.com These could become the “best” QLs … A B C Z
Problem Formulation • However… • Unknown pages might become QLs • Automatic-redirect pages might become QLs: • nytimes.com forces logging in • aaa.com forces zipcode entry • We need QLs that are “noticeable” in a search context
Problem Formulation • How can we estimate noticeability? • Via Search click-logs • Noticeability of a URL u: • User notices a useful QL with probability α(u) Tuning param(≈ 2) Fraction of search clicks for u on website
Problem Formulation nasa.gov # trailprob#clicks saves 2 x α1 x 2 saves 1 x α1 x 1 saves 2 x (1-α2)α1 x 1 saves 2 x α2 x 2 Total = 5α1 + 4α2 + 2(1-α1)α2 ? (saves 0) QL1 (saves 0) QL2 Assumption:The user picks the best QL that he/she notices
Problem Formulation nasa.gov # trailprob#clicks saves 2 x α1 x 2 saves 1 x α1 x 1 saves 2 x (1-α2)α1 x 1 saves 2 x α2 x 2 Total = 5α1 + 4α2 + 2(1-α1)α2 (saves 0) QL1 (saves 0) QL2 If only QL1 is perfectly noticeable (α1=1, α2=0): Total = 7 clicks (as if 1 QL only) If both QLs are perfectly noticeable (α1=1, α2=1): Total = 9 clicks
Problem Formulation • Which k URLs should be QLs? • Maximize the expected number of clicks saved using at most k QLs • while incorporating “noticeability”
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Algorithms • Maximize expected number of saved clicks using k QLs NP-Hard • Theorem: This objective is non-decreasing submodular • Non-negative • Adding QLs never hurts • “Diminishing Returns” u Marginal improvement to superset S’ Marginal improvement to set S
Algorithms • Greedy algorithm: Iteratively pick QLs that increase the number of saved clicks the most • Within a factor (1-1/e) of OPT[Nemhauser+/’78]
Algorithms • However… • Inhomogeneous results: QLs for ea.com are • fifa08.ea.com • battlefield.ea.com • 6 webpages deep inside thesim2.ea.com • Redundant results: QLs for senate.gov include • obama.senate.gov • obama.senate.gov/about • obama.senate.gov/contact • obama.senate.gov/votes Two games made by EA Parent URL makes the child URLs redundant
Algorithms • Both can be specified as pairwise constraints on URLs allowed to belong to a QL set • Pairwise-constrained QL selection isNP-hard. • Two-step process: • Heuristically find a large subset of trails that form a tree • Enforce constraints on tree • Dynamic program optimal on tree
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Experiments • Baseline Methods • TopClicked: • URL score = # search clicks on URL • TopVisited: • URL score = # occurrences on toolbar trails • PageRank: • Build a weighted graph on URLs, where weight(i,j) = # trails using the ij edge • URL score = PageRank on this graph
Experiments • Live Traffic dataset • Computed CTRs on QLs currently displayed by Yahoo! (1043 website subset) • Measure: • Pick two equal-sizes subsets of QLs • Use sum-of-scores and sum-of-CTRs to predict the better subset • Measure how often the predictions match
Live Traffic Data Experiments Fraction of subset-pairs where predictions agree with live traffic Subset sizes QL-ALG > TopVisited > PageRank > TopClicked
Experiments 100 80 • Tree-structured trails • Most dropped trails are very short • Tree-structured trails improve accuracy 60 Number of trails dropped 40 20 0 1 10 100 1000 10000 Length of trail Distribution of dropped trails Live Traffic prediction quality comparison
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Conclusions • Proposed a formulation for the QL selection problem • Both toolbar and search logs are used intuitively • Proposed two algorithms: • Greedy: (1-1/e)-optimal • Tree-structured: empirically better • Improvement of 22% over competing baselines