Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data

Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data WWW 2010 Yisong Yue Cornell Univ. Rajan Patel Google Inc. Hein Roehrig Google Inc.

User Feedback in Search Systems • Cheap & representative feedback • Evaluation metrics • Optimization criterion • How to interpret feedback accurately? • Clicks on (web) search results • Data plentiful • Important domain

Interpreting Clicks • What does click mean? • Click means good? • How good?

How Are Clicks Biased? • In what ways do clicks not directly reflect user utility or preferences? • Presentation Bias • Only click on what they pay attention to • E.g., position bias (more clicks at top of ranking) • Understanding presentation bias essential to more accurately interpreting feedback

Maybe 3rd result looked more relevant • i.e., judging a book by its cover • Maybe 3rd result attracted more attention • E.g., eye-catching • Many matching query terms (in bold)

Summary Attractiveness • Goal: quantify the effect of summary attractiveness on click behavior • Web search context • First study to conduct a rigorous statistical analysis on summary attractiveness bias

Controlling for Position • Position bias is the largest biasing effect • Need to control for it in order to analyze other biasing effects • Use FairPairs randomization • [Radlinski & Joachims, 2006]

FairPairs Example • Original: 1 2 3 4 5 6 7 8 9 10 • FairPair1: 1 2 3 4 5 6 7 8 9 10 • Swap: 2 1 3 4 6 5 8 7 9 10 • FairPair2: 12 3 4 5 6 7 8 9 10 • Swap: 12 3 5 4 7 6 9 8 10 • Randomly choose pairing scheme • Randomly swap each intra-pair ordering independently [Radlinski & Joachims, AAAI 2006]

Interpreting FairPairs Clicks Conclusion: B > A Clicks indicate pairwise preference (relative quality).

Thought Experiment • Two results A & B • Equally relevant for some query • Ranked adjacently in search results • AB and BA shown equally often (FairPairs) • A has an attractive title. B does not. • Who gets more clicks, A or B?

Click Data • Ran FairPairs randomization • A portion of Google US web search traffic. • 8/1/2009 to 8/20/2009 • 439,246 clicks collected

Human Judged Ratings • Sampled a subset of 1150 FairPairs. • Asked human raters to explicitly judge which of the pair is more relevant. • 5 judgments for each • Human raters must navigate to landing page.

Measuring Attractiveness • Relative measure of attractiveness • Difference of bolded query terms in title & abstract • Bottom result has +2 bolded terms in title • Bottom result has +2 bolded terms in abstract

Measuring Attractiveness • Clearly, query/title similarity is informative. • Good results should have titles that strongly match • But would blindly counting clicks cause us to over-value query/title similarity?

Rated Clicks Model

Null Hypothesis • Title & abstract bolding have 0 effect • Position and relative (judged) quality are the only factors affecting click probability.

Fitted Model

Leveraging All Clicks • Previous model required human judgments • We need to calibrate against relative quality • How to do this on all 400,000+ clicks? • Make independence assumptions!

Intuition • Virtually all search engines predict rankings using many attributes (or features). • Query/title similarity is only one component. • Example: a document with low query/title similarity might achieve high ranking due to very relevant body text.

Example > > > > 1st feature: query/title similarity 2nd feature: query/body similarity

Assumption • Take pairs of adjacent documents at random • Collect relative relevance ratings • Human rated preferences • Should be independent of title bolding difference • Can check using statistical model

Rated Agreement Model

Fitted Model Assumption approximately satisfied for query/title similarity.

Title Bias Effect (All Clicks) • Bars should be equal if not biased

Evaluation Metrics & Optimization • Pairwise preferences common for evaluation • E.g., maximize FairPairs agreement • Goal: maximize pairwise relevance agreement • Want to be aligned with click agreement • Danger: might conclude current system is undervaluing query/title similarity • Down-weight clicks on results with more title bolding • E.g., weight clicks by exp(-wTXT)

Directions to Explore • Other ways to measure summary attractiveness • Use other summary content • Other forms of presentation bias • Anything that draws people’s attention • Ways to interpret and adjust for bias • More accurate ways to quantify bias • More accurate evaluation metrics

Extra Slides

Fitted Model (All Clicks)

Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data