1 / 56

Which Algorithms Really Matter?

Which Algorithms Really Matter?. Me, Us. Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG MapR Distributes more open source components for Hadoop

calvin
Télécharger la présentation

Which Algorithms Really Matter?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Which Algorithms Really Matter?

  2. Me, Us • Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG • MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR

  3. Topic For Today • What is important? What is not? • Why? • What is the difference from academic research? • Some examples

  4. What is Important? • Deployable • Robust • Transparent • Skillset and mindset matched? • Proportionate

  5. What is Important? • Deployable • Clever prototypes don’t count if they can’t be standardized • Robust • Transparent • Skillset and mindset matched? • Proportionate

  6. What is Important? • Deployable • Clever prototypes don’t count • Robust • Mishandling is common • Transparent • Will degradation be obvious? • Skillset and mindset matched? • Proportionate

  7. What is Important? • Deployable • Clever prototypes don’t count • Robust • Mishandling is common • Transparent • Will degradation be obvious? • Skillset and mindset matched? • How long will your fancy data scientist enjoy doing standard ops tasks? • Proportionate • Where is the highest value per minute of effort?

  8. Academic Goals vs Pragmatics • Academic goals • Reproducible • Isolate theoretically important aspects • Work on novel problems • Pragmatics • Highest net value • Available data is constantly changing • Diligence and consistency have larger impact than cleverness • Many systems feed themselves, exploration and exploitation are both important • Engineering constraints on budget and schedule

  9. Example 1: Making Recommendations Better

  10. Recommendation Advances • What are the most important algorithmic advances in recommendations over the last 10 years? • Cooccurrence analysis? • Matrix completion via factorization? • Latent factor log-linear models? • Temporal dynamics?

  11. The Winner – None of the Above • What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood

  12. The Real Issues • Exploration • Diversity • Speed • Not the last fraction of a percent

  13. Result Dithering • Dithering is used to re-order recommendation results • Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better

  14. Result Dithering • Dithering is used to re-order recommendation results • Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change”

  15. Simple Dithering Algorithm • Generate synthetic score from log rank plus Gaussian • Pick noise scale to provide desired level of mixing • Typically • Oh… use floor(t/T) as seed

  16. Example … ε = 0.5

  17. Example … ε= log 2 = 0.69

  18. Exploring The Second Page

  19. Lesson 1: Exploration is good

  20. Example 2: Bayesian Bandits

  21. Bayesian Bandits • Based on Thompson sampling • Very general sequential test • Near optimal regret • Trade-off exploration and exploitation • Possibly best known solution for exploration/exploitation • Incredibly simple

  22. Thompson Sampling • Select each shell according to the probability that it is the best • Probability that it is the best can be computed using posterior • But I promised a simple answer

  23. Thompson Sampling – Take 2 • Sample θ • Pick i to maximize reward • Record result from using i

  24. Fast Convergence

  25. Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelleand Li, 2011

  26. Bayesian Bandits versus Result Dithering • Many useful systems are difficult to frame in fully Bayesian form • Thompson sampling cannot be applied without posterior sampling • Can still do useful exploration with dithering • But better to use Thompson sampling if possible

  27. Lesson 2: Exploration is pretty easy to do and pays big benefits.

  28. Example 3: On-line Clustering

  29. The Problem • K-means clustering is useful for feature extraction or compression • At scale and at high dimension, the desirable number of clusters increases • Very large number of clusters may require more passes through the data • Super-linear scaling is generally infeasible

  30. The Solution • Sketch-based algorithms produce a sketch of the data • Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution • The size of the sketch grows very slowly with increasing data size • Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis, Michael Jordan.

  31. An Example

  32. An Example

  33. The Cluster Proximity Features • Every point can be described by the nearest cluster • 4.3 bits per point in this case • Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) • Error is negligible • Unwinds the data into a simple representation • Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n)

  34. Diagonalized Cluster Proximity

  35. Lots of Clusters Are Fine

  36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together

  37. Streaming k-means Ideas • By using a sketch with lots (k log N) of centroids, we avoid pathological cases • We still get a very good result if the sketch is created • in one pass • with approximate search • In fact, adaptive dp-means works just fine • In the end, the sketch can be used for clustering or …

  38. Lesson 3: Sketches make big data small.

  39. Example 4: Search Abuse

  40. Recommendations Alice got an apple and a puppy Alice Charles got a bicycle Charles

  41. Recommendations Alice got an apple and a puppy Alice Bob got an apple Bob Charles got a bicycle Charles

  42. Recommendations Alice ? What else would Bob like? Bob Charles

  43. Log Files Alice Charles Charles Alice Alice Bob Bob

  44. History Matrix: Users by Items ✔ ✔ ✔ Alice ✔ ✔ Bob ✔ ✔ Charles

  45. Co-occurrence Matrix: Items by Items How do you tell which co-occurrences are useful?. 1 2 0 - 0 1 1 1 1 2 0 1 0

  46. Co-occurrence Binary Matrix not 1 1 1 not

  47. Indicator Matrix: Anomalous Co-Occurrence Result: The marked row will be added to the indicator field in the item document… ✔ ✔

  48. Indicator Matrix That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.

  49. Internals of the Recommender Engine

  50. Internals of the Recommender Engine

More Related