Model-Based Multi-Armed Bandits for Optimizing Ad Impressions Through Taxonomy Exploration

Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski

Ads DB Ads (click) The Content Match Problem Advertisers Ad impression: Showing an ad to a user

Ads DB Ads (click) The Content Match Problem Advertisers Ad click: user click leads to revenue for ad server and content provider

The Content Match Problem Ads Ads DB Advertisers The Content Match Problem: Match ads to pages to maximize clicks

The Content Match Problem Ads Ads DB Advertisers • Maximizing the number of clicks means: • For each webpage, find the ad with the best Click-Through Rate (CTR), • but without wasting too many impressions in learning this.

Online Learning • Maximizing clicks requires: • Dimensionality reduction • Exploration • Exploitation Both must occur together Online learning is needed, since the system must continuously generate revenue

Root Apparel Computers Travel Page/Ad Taxonomies for dimensionality reduction • Already exist • Actively maintained • Existing classifiers to map pages and ads to taxonomy nodes Learn the matching from page nodes to ad nodes  dimensionality reduction

Online Learning • Maximizing clicks requires: • Dimensionality reduction • Exploration • Exploitation  Taxonomy ? Can taxonomies help in explore/exploit as well?

Outline • Problem • Background: Multi-armed bandits • Proposed Multi-level Policy • Experiments • Related Work • Conclusions

(unknown payoff probabilities) p1 p2 p3 Background: Bandits Bandit “arms” • Pull arms sequentially so as to maximize the total expected reward • Estimate payoff probabilities pi • Bias the estimation process towards better arms

Background: Bandits Webpage 1 Bandit “arms” = ads ~109 pages Webpage 2 Webpage 3 ~106 ads

Ads One bandit Webpages Background: Bandits Unknown CTR Content Match = • A matrix • Each row is a bandit • Each cell has an unknown CTR

Priority 1 Priority 2 Priority 3 Background: Bandits • Bandit Policy • Assign priority to each arm • “Pull” arm with max priority, and observe reward • Update priorities Allocation Estimation

Background: Bandits • Why not simply apply a bandit policy directly to our problem? • Convergence is too slow ~109 bandits, with ~106 arms per bandit • Additional structure is available, that can help Taxonomies

Multi-level Policy Ads classes Webpages classes …… … … …… Consider only two levels

Compu-ters Ad parent classes Apparel Travel Ad child classes Block One bandit Multi-level Policy Apparel …… Compu-ters … … …… Travel Consider only two levels

Compu-ters Ad parent classes Apparel Travel Ad child classes Block One bandit Multi-level Policy Apparel …… Compu-ters … … …… Travel Key idea: CTRs in a block are homogeneous

Multi-level Policy • CTRs in a block are homogeneous • Used in allocation (picking ad for each new page) • Used in estimation (updating priorities after each observation)

? Page classifier Multi-level Policy (Allocation) • Classify webpage  page class, parent page class • Run bandit on ad parent classes  pick one ad parent class A C T A C T

Page classifier Multi-level Policy (Allocation) • Classify webpage  page class, parent page class • Run bandit on ad parent classes  pick one ad parent class • Run bandit among cells  pick one ad class • In general, continue from root to leaf  final ad ad A C T ? A C T

Page classifier Multi-level Policy (Allocation) Bandits at higher levels • use aggregated information • have fewer bandit arms • Quickly figure out the best ad parent class ad A C T A C T

Multi-level Policy • CTRs in a block are homogeneous • Used in allocation (picking ad for each new page) • Used in estimation (updating priorities after each observation)

Multi-level Policy (Estimation) • CTRs in a block are homogeneous • Observations from one cell also give information about others in the block • How can we model this dependence?

Multi-level Policy (Estimation) • Shrinkage Model # impressions in cell # clicks in cell Scell | CTRcell ~ Bin (Ncell, CTRcell) CTRcell ~ Beta (Paramsblock) All cells in a block come from the same distribution

Multi-level Policy (Estimation) • Intuitively, this leads to shrinkage of cell CTRs towards block CTRs E[CTR] = α.Priorblock + (1-α).Scell/Ncell Estimated CTR Beta prior (“block CTR”) Observed CTR

Experiments Root Depth 0 20 nodes Depth 1 We use these 2 levels 221 nodes Depth 2 … Depth 7 ~7000 leaves Taxonomy structure

Experiments • Data collected over a 1 day period • Collected from only one server, under some other ad-matching rules (not our bandit) • ~229M impressions • CTR values have been linearly transformed for purposes of confidentiality

Experiments (Multi-level Policy) Clicks Number of pulls Multi-level gives much higher #clicks

Experiments (Multi-level Policy) Mean-Squared Error Number of pulls Multi-level gives much better Mean-Squared Error  it has learnt more from its explorations

Experiments (Shrinkage) without shrinkage Clicks Mean-Squared Error with shrinkage Number of pulls Number of pulls Shrinkage  improved Mean-Squared Error, but no gain in #clicks

Related Work • Typical multi-armed bandit problems • Do not consider dependencies • Very few arms • Bandits with side information • Cannot handle dependencies among ads • General MDP solvers • Do not use the structure of the bandit problem • Emphasis on learning the transition matrix, which is random in our problem.

Conclusions • Taxonomies exist for many datasets • They can be used for • Dimensionality Reduction • Multi-level bandit policy  higher #clicks • Better estimation via shrinkage models  better MSE

Model-Based Multi-Armed Bandits for Optimizing Ad Impressions Through Taxonomy Exploration