320 likes | 406 Vues
Discover effective strategies for selecting Quicklinks on search result websites to enable faster navigation to key sections. Explore the importance of relevance and noticeability in maximizing clicks saved. Learn about related work, problem formulation, proposed solutions, and experiment insights.
E N D
Quicklink Selection for Navigational Query Results Deepayan Chakrabarti (deepay@yahoo-inc.com) Ravi Kumar (ravikuma@yahoo-inc.com) Kunal Punera (kpunera@yahoo-inc.com)
What are quicklinks Result Website Quicklinks
Quicklinks Result Website • Quicklinks = URLs within the search result website • Enable fast navigation to important parts of the website • Which URLs should be QLs? Quicklinks
Quicklink Selection • Some obvious strategies don’t work very well • Top clicked URLs in search engine • URL may have low relevance in the QL context • lib.utexas.edu/maps is popular for searches on “maps” and not for searches on “Univ. of Texas” • URL may be too specific: • automobiles.honda.com/civic-hybrid/exterior-photos.aspx for honda.com • URL popularity be time sensitive: • nytimes.com/election-guide/2008/ for nytimes.com
Quicklink Selection • Some obvious strategies don’t work very well • Top clicked URLs in search engine • Top visited URLs intoolbar data • May not relate to search activity:e.g., for nytimes.com • #3 is nytimes.com/mem/emailthis.html • #6 isnytimes.com/auth/login • #8 isnytimes.com/gst/regi.html
Quicklink Selection • Some obvious strategies don’t work very well • Top clicked URLs in search engine • Top visited URLs in toolbar data • Top URLs from analysis of hyperlink graph • Ignores preferences of search users • Toolbar data is more representative • Heavily tagged URLs (e.g., del.icio.us/digg) • Low coverage: Too few websites
Quicklink Selection • Need a combined approach • Search logs • Toolbar data • Web-server logs • Website hyperlink graph • User tags This paper
Related Work • Sitemap generation [Perkowitz+/00] • Detection of hard-to-find URLs [Srikant+/01] • Improving website navigability [Doerr+/07] • Mining Web usage patterns [Buchner/99, Cadez+/03] • BrowseRank [Liu+/08] • Post-search browsing behavior [Bilenko+/08] We focus on QLs in the context of Search
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Problem Formulation • Which k URLs should be QLs? “The greatest good for the greatest number” • QLs save clicks • Maximize the total number of clicks saved using at most k QLs • But when exactly is a click “saved”?
Problem Formulation • When does a QL get clicked by the user? Say we pick this node as a QL nasa.gov Hubble telescope Photos Graph of click trails (Toolbar data)
Problem Formulation Say we pick this node as a QL nasa.gov Hubble telescope Photos Graph of click trails (Toolbar data) Assumption:The user recognizes if SearchResult QL Destination
Problem Formulation nasa.gov (saves 1 click each) Say we pick this node as a QL Graph of click trails (Toolbar data) Assumption:The user recognizes if SearchResult QL Destination
Problem Formulation nasa.gov (saves 1 click each) (saves 0) Say we pick this node as a QL (saves 0) (saves 2 clicks each) Total savings = 1*3 + 2*2 = 7 clicks Graph of click trails (Toolbar data) Assumption:The user recognizes if SearchResult QL Destination
Problem Formulation • However… • Unknown pages might become QLs lyrics.com These could become the “best” QLs … A B C Z
Problem Formulation • However… • Unknown pages might become QLs • Automatic-redirect pages might become QLs: • nytimes.com forces logging in • aaa.com forces zipcode entry • We need QLs that are “noticeable” in a search context
Problem Formulation • How can we estimate noticeability? • Via Search click-logs • Noticeability of a URL u: • User notices a useful QL with probability α(u) Tuning param(≈ 2) Fraction of search clicks for u on website
Problem Formulation nasa.gov # trailprob#clicks saves 2 x α1 x 2 saves 1 x α1 x 1 saves 2 x (1-α2)α1 x 1 saves 2 x α2 x 2 Total = 5α1 + 4α2 + 2(1-α1)α2 ? (saves 0) QL1 (saves 0) QL2 Assumption:The user picks the best QL that he/she notices
Problem Formulation nasa.gov # trailprob#clicks saves 2 x α1 x 2 saves 1 x α1 x 1 saves 2 x (1-α2)α1 x 1 saves 2 x α2 x 2 Total = 5α1 + 4α2 + 2(1-α1)α2 (saves 0) QL1 (saves 0) QL2 If only QL1 is perfectly noticeable (α1=1, α2=0): Total = 7 clicks (as if 1 QL only) If both QLs are perfectly noticeable (α1=1, α2=1): Total = 9 clicks
Problem Formulation • Which k URLs should be QLs? • Maximize the expected number of clicks saved using at most k QLs • while incorporating “noticeability”
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Algorithms • Maximize expected number of saved clicks using k QLs NP-Hard • Theorem: This objective is non-decreasing submodular • Non-negative • Adding QLs never hurts • “Diminishing Returns” u Marginal improvement to superset S’ Marginal improvement to set S
Algorithms • Greedy algorithm: Iteratively pick QLs that increase the number of saved clicks the most • Within a factor (1-1/e) of OPT[Nemhauser+/’78]
Algorithms • However… • Inhomogeneous results: QLs for ea.com are • fifa08.ea.com • battlefield.ea.com • 6 webpages deep inside thesim2.ea.com • Redundant results: QLs for senate.gov include • obama.senate.gov • obama.senate.gov/about • obama.senate.gov/contact • obama.senate.gov/votes Two games made by EA Parent URL makes the child URLs redundant
Algorithms • Both can be specified as pairwise constraints on URLs allowed to belong to a QL set • Pairwise-constrained QL selection isNP-hard. • Two-step process: • Heuristically find a large subset of trails that form a tree • Enforce constraints on tree • Dynamic program optimal on tree
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Experiments • Baseline Methods • TopClicked: • URL score = # search clicks on URL • TopVisited: • URL score = # occurrences on toolbar trails • PageRank: • Build a weighted graph on URLs, where weight(i,j) = # trails using the ij edge • URL score = PageRank on this graph
Experiments • Live Traffic dataset • Computed CTRs on QLs currently displayed by Yahoo! (1043 website subset) • Measure: • Pick two equal-sizes subsets of QLs • Use sum-of-scores and sum-of-CTRs to predict the better subset • Measure how often the predictions match
Live Traffic Data Experiments Fraction of subset-pairs where predictions agree with live traffic Subset sizes QL-ALG > TopVisited > PageRank > TopClicked
Experiments 100 80 • Tree-structured trails • Most dropped trails are very short • Tree-structured trails improve accuracy 60 Number of trails dropped 40 20 0 1 10 100 1000 10000 Length of trail Distribution of dropped trails Live Traffic prediction quality comparison
Outline • Motivation and Related Work • Problem Formulation • Proposed Solution • Experiments • Conclusions
Conclusions • Proposed a formulation for the QL selection problem • Both toolbar and search logs are used intuitively • Proposed two algorithms: • Greedy: (1-1/e)-optimal • Tree-structured: empirically better • Improvement of 22% over competing baselines