When Machine Learning Meets the Web

Chao Liu Internet Services Research CenterMicrosoft Research-Redmond When Machine Learning Meets the Web

Outline • Motivation & Challenges • Background on Distributed Computing • Standard ML on MapReduce • Classification: Naïve Bayes • Clustering: Nonnegative Matrix Factorization • Modeling: EM Algorithm • Customized ML on MapReduce • Click Modeling • Behavior Targeting • Conclusions

Motivation & Challenges • Data on the Web • Scale: terabyte-to-petabyte data • Around 20TB log data per day from Bing • Dynamics: evolving data streams • Click data streams with evolving/emerging topics • Applications: Non-traditional ML tasks • Predicting clicks & ads

Parallel vs. Distributed Computing • Parallel computing • All processors have access to a shared memory, which can be used to exchange information between processors • Distributed computing • Each processor has its own private memory (distributed memory), communicating over the network • Message passing • MapReduce

MPI vs. MapReduce • MPI is for task parallelism • Suitable for CPU-intensive jobs • Fine-grained communication control, powerful computation model • MapReduce is for data parallelism • Suitable for data-intensive jobs • A restricted computation model

Word Counting on MapReduce Mapper Mapper Mapper docs docs docs Web corpus on multiple machines Reducer Reducer Reducer … … … … (docId, doc) pairs (docId, doc) pairs (docId, doc) pairs Mapper: for each word w in a doc, emit (w, 1) (w1,1) (w1,1) (w1,1) (w3,1) (w3,1) (w3,1) (w2,1) (w2,1) Intermediate (key,value) pairs are aggregated by word Aggregate values by keys (w1,<1,1, 1>) (w2,<1, 1>) (w3,<1,1,1>) … Reducer is copied to each machine to run over the intermediate data locally to produce the result (w1, 3) (w2, 2) (w3, 3)

Machine Learning on MapReduce • A big picture: Not Omnipotent but good enough

Classification: Naïve Bayes Mapper Mapper • P(C|X) P(C) P(X|C) =P(C)∏P(Xj|C) … … … … … … Reduce on y(i) (j, xj(i),y(i)) P(C) (x(i),y(i)) (j, xj(i),y(i)) P(Xj|C) (x(i),y(i)) (j, xj(i),y(i)) Reduce on j

Clustering: Nonnegative Matrix Factorization [Liu et al., WWW2010] • Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006] • Interpretable dimensionality reduction [Lee & Seung, 1999] • Document clustering [Shahnaz et al., 2006, Xu et al, 2006] • Challenge: Can we scale NMF to million-by-million matrices

NMF Algorithm [Lee & Seung, 2000]

Distributed NMF … … • Data Partition: A, W and H across machines . . . . . . . . . .

Computing DNMF: The Big Picture

… … … … … Map-III Map-V Map-II Map-I Map-IV … Reduce-II Reduce-I Reduce-III … … … Reduce-V …

X = WTA … … Map-II Map-I … Reduce-I Reduce-II … … …

Y = WTWH … … … Map-III Map-IV Reduce-III . . . . . . . . . . .

H = H.*X/Y … … … Map-V … Reduce-V …

… … … … … Map-III Map-V Map-II Map-I Map-IV … Reduce-II Reduce-I Reduce-III … … … Reduce-V …

Scalability w.r.t. Matrix Size 3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

General EM on MapReduce • Map • Evaluate • Compute • Reduce

Click Modeling: Motivation • Clicks are good… • Are these two clicks equally “good”? • Non-clicks may have excuses: • Not relevant • Not examined

Eye-tracking User Study

Bayesian Browsing Model [Liu et al., KDD2009] URL1 URL2 URL3 URL4 query S4 S1 S2 S3 Relevance Examine Snippet E4 E1 E2 E3 C4 C1 C2 C3 ClickThroughs

Dependencies in BBM … Si S1 S2 … Ei E1 E2 the preceding click position before i Ci C1 C2 …

Model Inference • Ultimate goal • Observation: conditional independence

P(C|S) by Chain Rule • Likelihood of search instance • From S to R:

Putting Things Together • Posterior with • Re-organize by Rj’s How many times dj was not clicked when it is at position (r + d) and the preceding click is on position r How many times dj was clicked

What p(R|C1:n) Tells Us • Exact inference with joint posterior in closed form • Joint posterior factorizes and hence mutually independent • At most M(M+1)/2 + 1 numbers to fully characterize each posterior • Count vector:

An Example 0 3 2 1 • Compute • Count vector for R4 0 0 0 0 0 0 1 2 N4, r, d 1 0 N4 1

LearnBBM on MapReduce • Map: emit((q,u), idx) • Reduce: construct the count vector

Example on MapReduce Map Map Map (U1, 0) (U2, 4) (U3, 0) (U1, 1) (U3, 0) (U4, 7) (U1, 1) (U3, 0) (U4, 0) Reduce (U1, 0, 1, 1) (U2, 4) (U3, 0, 0, 0) (U4, 0, 7)

Petabyte-Scale Experiment • Setup: • 8 weeks data, 8 jobs • Job k takes first k-week data • Experiment platform • SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets [Chaiken et al, VLDB’08]

Scalability of BBM • Increasing computation load • more queries, more urls, more impressions • Near-constant elapse time • 3 hours • Scan 265 terabyte data • Full posteriors for 1.15 billion (query, url) pairs Elapse Time on SCOPE Computation Overload

Large-scale Behavior Targeting [Ye et al., KDD2009] • Behavior targeting • Ad serving based on users’ historical behaviors • Complementary to sponsored Ads and content Ads

Problem Setting • Goal • Given ads in a certain category, locate qualified users based on users’ past behaviors • Data • User is identified by cookie • Past behavior, profiled as a vector x, includes ad clicks, ad views, page views, search queries, clicks, etc • Challenges: • Scale: e.g., 9TB ad data with 500B entries in Aug'08 • Sparse: e.g., the CTR of automotive display ads is 0.05% • Dynamic: i.e., user behavior changes over time.

Learning: Linear Poisson Model • CTR = ClickCnt/ViewCnt • A model to predict expected click count • A model to predict expected view count • Linear Poisson model • MLE on w

Implementation on MapReduce • Learning • Map: Compute and • Reduce: Update • Prediction

Conclusions • Challenges imposed by Web data • Scalability of standard algorithms • Application-driven customized algorithms • Capability to consume huge amount of data outweighs algorithm sophistication • Simple counting is no less powerful than sophisticated algorithms when data is abundant or even infinite • MapReduce: a restricted computation model • Not omnipotent but powerful enough • Things we want to do turn out to be things we can do

Q&A Thank You! SEWM‘10 Keynote, Chengdu, China

When Machine Learning Meets the Web

When Machine Learning Meets the Web

Presentation Transcript

FrameNet Meets the Semantic Web

When Cryptography Meets Storage

When the Sensor Meets the Cloud

When Einstein meets Buddha

FrameNet Meets the Semantic Web

Problem-Based Learning Meets Web 2.0

The Phone Meets the Web.

Mail Meets the Web

When the Old Meets the New

When the Mantle Meets the Core

Web Mining: Machine Learning for Web Applications

When Smalltalk Meets the WEB

When the organization meets the institution …

When Cryptography Meets Storage

When Simulation Meets Antichains

When Tango Meets Eclipse

USER MODELING meets the Web

When Machine Learning Meets the Web

When Yoga Meets Ayurveda

When Light Meets Aroma Meets The Internet

When Signal Processing Meets Machine Learning