Handling Advertisements of Unknown Quality in Search Advertising

Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)

Sponsored Search • How does it work? • Search engine displays ads next to search results • Advertisers pay search engine per click • Who benefits from it? • Main source of funding for search engines • Information flow from advertisers to users

Sponsored Search • Click-through-rate (CTR): given an ad and a query, CTR = probability that the ad receives a click • Optimal policy to maximize search engine’s revenue: display ads of highest (CTR x bid) value Search query results Sponsored search results

show ads refine CTR estimates record clicks earn revenue Challenges in Sponsored Search • Problem: CTRs initially unknown • estimating CTRs requires going around the circle • Exploration/Exploitation Tradeoff: • explore ads to estimate CTRs • exploit known high-CTR ads to maximize revenue

Query phrases Budgets Ads a1,1 d1 Q1 A1 a2,1 a1,3 d2 A2 Q2 a3,2 d3 A3 Q3 Advertisers The Advertisement Problem • Problem: • Advertiser Ai submits ad ai,j for Query phrase Qj • User clicks on aij -> Ai pays bij (the “bid value”) • Queries arrive one after another • Select ads to show for each query, in an online fashion • Constraints: • Show at most C ads per query • Advertisers have daily budgets: Ai pays at most di • Goal: Maximize search engine’s revenue

Our Approach • Unbudgeted Advertisement Problem • Isomorphic to multi-armed bandit problem • Budgeted Advertisement Problem • Similar to bandit problem, but with additional budget constraints that span arms • Introduce Budgeted Multi-armed Multi-bandit problem (BMMP)

p1 p2 p3 Unbudgeted Advertisement Problem as Multi-armed Bandit Problem • Bandit: Classical example of online learning under the explore/exploit tradeoff • K arms. Arm ihas an associated reward ri and unknown payoff probability pi • Pull C arms at each time instant to maximize the reward accrued over time • Isomorphism: query phrase bandit instance; ads arms; CTR payoff probability; bid reward

Policy for Unbudgeted Problem • Policy “MIX” (adopted from [Auer et. al. ML’02]) • When query phrase Qj arrives • Compute the priority pi,j of each ad ai,j where pi,j = (ei,j + sqrt(2 ln nj / ni,j)) . bi,j • ei,j is the MLE of the CTR value of ai,j • bi,j is the price or bid value of ad ai,j • ni,j : # times ad ai,j has been shown in the past • nj : # times query Qj has been answered • Display the C highest-priority ads

Budgeted Multi-armed Multi-Bandit problem (BMMP) • Finite set of bandit instances; each instance has a finite number of arms • Each arm has an associated type • Each type Ti has budget di • Upper limit on the total amount of reward that can be generated by the arms of type Ti • An external actor invokes a bandit instance at each time instant • the policy must choose C arms of the invoked instance

Meta Policy for BMMP • Input: BMMP instance and policy POL for the conventional multi-armed bandit problem • Output: The following Policy BPOL • Run POL in parallel for each bandit instance Bi • Whenever Bi is invoked: • Discard arm(s) with depleted budget • If one or more arms was discarded, restart POLi • Let POLi decide which of the remaining arms to activate

Performance Guarantee of BPOL • OPT = algorithm that knows in advance: • Full sequence of bandit invocations • Payoff probabilities • Claim: bpol(N) >= opt(N)/2 – O(f(N)) • bpol(N): total expcted reward of BPOL policy after N bandit invocations • opt(N): total expected reward of OPT • f(N): regret of POL after N invocations of the regular bandit problem

Proof of Performance Guarantee • Divide the time instants into 3 categories: • 1 : BPOL chooses an arm of higher expected reward than OPT • opt1(N)<= bpol1(N) • 2 : BPOL chooses an arm of lower expected reward because OPT’s arm has run out of budget • opt2(N) <= bpol2(N) + (#types . max reward) • 3 : otherwise • opt3(N) = O(f(N)) • Claim (implies from the above bounds) • opt(N) <= bpol(N) + bpol(N) + O(1) + O(f(N)) • bpol(N) >= opt(N)/2 – O(f(n))

Advertisement Policies • BMIX : Output of our generic BPOL policy when given MIX as input • BMIX-E :Replace sqrt(2 ln nj / ni,j) in priority pi,j by sqrt(min(0.25, V(ni,j,nj)). ln nj / ni,j), where V(ni,j,nj) = ei,j .(1-ei,j). sqrt(2 ln nj / ni,j) • Suggested in Auer. et. al. ML’02. • Purpose: Aggressive exploitation • BMIX-T :Replace bi,j in priority pi,j by bi,j . throttle(di‘), throttle(di‘) = 1-e^(- di‘/di) where di‘ is the remaining budget of advertiser Ai • Suggested in Mehta et. al. FOCS’05 • Purpose: Delay the depletion of advertisers’ budgets • BMIX-ET: with both E and T modifications

Experiments • Simulations over real data • Data: • 85,000 query phrases from Yahoo! query log • Yahoo! ads with daily budget constraints • CTRs drawn from Yahoo!’s CTR distribution • Simulated user clicks using the CTR values • Time horizon = multiple days • Policies carried over the CTR estimates from one day to the next

Results • GREEDY : select ads with highest current reward estimate (ei,j . bi,j) • Does not explore. Only exploits. *Revenue values scaled for confidentiality reasons

Conclusion • Search advertisement problem • Exploration/exploitation tradeoff • Model as multi-armed bandit • Introduced new Bandit variant • Budgeted multi-armed multi-bandit problem (BMMP) • New policy for BMMP with performance guarantee • In paper: • Variable set of ads (ads come and go) • Prior CTR estimates

Handling Advertisements of Unknown Quality in Search Advertising