Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

Shuffling a Stacked DeckThe Case for Partially Randomized Ranking of Search Engine Results Sandeep Pandey1, Sourashis Roy2, Christopher Olston1, Junghoo Cho2, Soumen Chakrabarti3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay

--------- • --------- • --------- Popularity as a Surrogate for Quality • Search engines want to measure the “quality” of pages • Quality hard to define and measure • Various “popularity” measures are used in ranking • e.g., in-links, PageRank, usertraffic

Relationship Between Popularity and Quality • Popularity : depends on the number of users who “like” a page • relies on both awareness and quality of the page • Popularity correlated with quality • when awareness is large

Problem • Popularity/quality correlation weak for young pages • Even if of high quality, may not (yet) be popular due to lack of user awareness • Plus, process of gaining popularity inhibited by “entrenchment effect”

--------- • --------- • --------- • --------- • --------- • --------- … user attention entrenched pages Entrenchment Effect • Search engines show entrenched (already-popular) pages at the top • Users discover pages via search engines; tend to focus on top results

Outline • Problem introduction • Evidence of entrenchment effect • Key idea: Mitigate entrenchment by introducing randomness into ranking • Model of ranking and popularity evolution • Evaluation • Summary

Evidences of the Entrenchment Do search engines suppress controversy? - Susan L. Gerhart More news, less diversity - New York Times Googlearchy Distinction of retrievability and visibility The politics of search engines - IEEE Computer • The political economy • of linking on the Web • ACM conf. on • Hypertext & Hypermedia Are search engines biased? - Chris Sherman Bias on the Web - Comm. of the ACM

Quantification of Entrenchment Effect • Impact of Search Engines on Page Popularity • Real Web study by Cho et. al. [WWW’04] • Pages downloaded every week from 154 sites • Partitioned into 10 groups based on initial link popularity • After 7 months, • 70% of new links to top 20% pages • Decrease in PageRank for bottom 50% pages

Alternative Approaches to Counter-act Entrenchment Effect • Weight links to young pages more • [Baeza-Yates et. al SPIRE ’02] • Proposed an age-based variant of PageRank • Extrapolate quality based on increase in popularity • [Cho et. al SIGMOD ’05] • Proposed an estimate of quality based on the derivative of popularity

1 1 500 2 2 3 . . . 3 . 500 499 501 501 Our Approach: Randomized Rank Promotion • Select random (young) pages to promote to good rank positions • Rank position to promote to is chosen at random

Our Approach: Randomized Rank Promotion • Consequence: Users visit promoted pages; improves quality estimate • Compared with previous approaches: • Does not rely on temporal measurements (+) • Sub-optimal (-)

Exploration/Exploitation Tradeoff • Exploration/Exploitation tradeoff • exploit known high-quality pages by assigning good rank positions • explore quality of new pages by promoting them in rank • Existing search engines only exploit (to our knowledge)

Possible Objectives for Rank Promotion • Fairness • Give each page an equal chance to become popular • Incentive for search engines to be fair? • Quality • Maximize quality of search results seen by users (in aggregate) • Quality page p: extent to which users “like” p • Q(p) [0,1] our choice

Squash Linux Model of the Web • Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.) • A community is made up of a set of pages, interested users and related queries

Model of the Web • Users visit pages only by issuing queries to search engine • Mixed surfing & searching considered in the paper • Query answer = ordered list containing all pages in the corresponding community • A single ranked list associated with each community • Since queries within a community are very similar

--------- • --------- • --------- • --------- • --------- • --------- … • --------- • --------- • --------- • --------- • --------- • --------- … Model of the Web Community on Squash Community on Linux • Consequence: Each community evolves independent of the other communities

Quality-Per-Click Metric (QPC) • V(p,t):number of visits to page p at time t • QPC : average quality of pages viewed by users, amortized over time

1 1 500 2 2 3 . . . 3 . 500 499 501 501 Desiderata for Randomized Rank Promotion Want ability to: • Control exploration/exploitation tradeoff • “Select” certain pages as candidates for promotion • “Protect’’ certain pages from demotion

1 2 W 3 4 1 2 3 4 Randomized Rank Promotion Scheme Promotion pool Wm random ordering Remainder W-Wm Lm order by popularity Ld

1-r r k-1 Randomized Rank Promotion Scheme Promotion list Remainder 1 2 1 2 4 3 Ld Lm 1 2 3 4 5 6 k = 3 r = 0.5

Parameters • Promotion pool(Wm) • Uniform rank promotion : give an equal chance to each page • Selective rank promotion : exclusively target zero awareness pages • Start rank (k) • rank to start randomization from • Degree of randomization (r) • controls the tradeoff between exploration and exploitation

Tuning the Parameters • Objective: maximize quality-per-click (QPC) • Entrenchment in a community depends on many factors • Number of pages and users • Page lifetimes • Visits per user • Two ways to tune • set parameters per community • one parameter setting for all communities

Popularity Evolution Cycle Popularity P(p,t) Awareness A(p,t) Rank R(p,t) Visit rate V(p,t)

DETAIL Popularity to Rank Relationship • Rank of a page under randomized rank promotion scheme • determined by a combination of popularity and randomness • Deterministic Popularity-based-ranking is a special case • i.e., r=0 • Unknown function FPR:rank as a function ofthe popularity of page p under a given randomized scheme R(p,t) = FPR(P(p,t))

DETAIL Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: 1 . 2 1 0 . 8 view probability 0 . 6 Probability of Viewing FRV(r) r –1.5 0 . 4 0 . 2 0 0 5 0 1 0 0 1 5 0 rank R a n k

DETAIL Visit to Awareness Relationship • Awareness A(p,t) :fraction of users who have visited page p at least once by time t

DETAIL Awareness to Popularity Relationship • Quality Q(p) :extent to which users like page p (contribute towards its popularity) • Popularity P(p,t) :

Popularity Evolution Cycle FPR(P(p,t)) FAP(A(p,t)) Popularity P(p,t) Awareness A(p,t) Rank R(p,t) Visit rate V(p,t) FRV(R(p,t)) FVA(V(p,t))

Next step : derive formula for popularity evolution curve Popularity P(p,t) time (t) Deriving Popularity Evolution Curve • Derive it using the awareness distribution of pages

Deriving Popularity Evolution Curve • Assumptions • number of pages constant • Pages are created and retired according to a Poisson process with rate parameter • Quality distribution of pages is stationary In the steady state, both popularity and awareness distribution of the pages are stationary

DETAIL Popularity Evolution Curve and Awareness Distribution Awareness distribution : fraction of pages of quality q whose awareness is i / (#users) Popularity EvolutionCurveE(x,q) : time duration for which a page of quality q has popularity value x Next: derive popularity evolution curve using the awareness distribution

DETAIL Popularity Evolution Curve and Awareness Distribution : interpret it as the probability of a page of quality q to have awareness ai at any point of time We know that : Hence,

DETAIL Deriving Awareness Distribution • : fraction of pages of quality q whose awareness is i / (#users) • Doing the steady state analysis, we get but remember that we do not know FPRyet R(p,t) = FPR(P(p,t))

DETAIL Deriving Awareness Distribution Good news: rank is a combination of popularity and randomness, we can derive FPR given . (ex. below) Start with an initial form of FPR; iterate till convergence

Summary of Where We Stand • Formalized the popularity evolution cycle • Relationship between popularity evolution and awareness distribution • Derived the awareness distribution • Next step: tune parameters • Recall, goal is to obtain scheme that: • achieves high QPC (quality per click) • is robust across a wide range of community types

Tuning the Promotion Scheme • Parameters: k, r and Wm • Objective: maximize QPC • Influential factors: • Number of pages and users • Page lifetimes • Visits per user

Default Community Setting Number of pages = 10,000 * Number of users = 1000 Visits per user = 1000 visits per day Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ] * How Much Information? SIMS, Berkeley, 2003

Tuning: Wm parameter • -no promotion • - uniform promotion • selective promotion k=1 and r=0.2

Tuning: k and r • Optimal r: (0,1) • Optimal r increases • with increasing k Based on simulation (reason: analysis only accurate for small values of r)

Tuning: k and r Deciding k & r : • k >= 2 for “feeling lucky” • Minimize amount of “junk” perceived • Maximize QPC

Final Parameter Settings • Promotion pool (Wm ): zero-awareness pages • Start rank (k): 1 or 2 • Randomization (r) : 0.1

Tuning the Promotion Scheme • Parameters: k, r and Wm • Objective: maximize QPC • Influential factors: • Number of pages and users • Page lifetimes • Visits per user

Influence of Number of Pages and Users

Influence of Page Lifetime and Visit rate

Influence of Visit Rate 1000 visits/day per user

Summary • Entrenchment effect hurts search result quality • Solution: Randomized rank promotion • Model of Web evolution and QPC metric • Used to tune & evaluate randomized rank promotion • Initial results • Significantly increases QPC • Robust across wide range of Web communities • More study required

THE END • Paper available at : www.cs.cmu.edu/~spandey

Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3

Presentation Transcript

Hypersearching the Web, Chakrabarti, Soumen

1. 2. 3.

. :, :1; 2; 3

1 : 2: 3:

3 1-2-1 :

1 1. 2. 3. 2 1. 2. ,,

Zimmerman1989,1990 : 1., 2. 3. ,Zimmerman1994 : 1. 2., 3.

,,: 1-1 :, 1-2 :, 2-1 :,, 2-2 :, 2-3 :,

1. 2. 3. : 4. 1 2:

1. 2. 3.

2+2 = 4 2x2 = 2+2 1+2 1/3

1 2 3

0.25, 1, 1 0, 3, 8 1, 3/2, 2 1/2, 2, 3

3, 2, 1

1 2 3

{… , – 3, – 2, – 1, 0, 1, 2, 3, …}