1 / 42

Building and Improving Products with Hadoop Matthew Rathbone

Building and Improving Products with Hadoop Matthew Rathbone. What is Foursquare. Foursquare helps you explore the world around you. Meet up with friends, discover new places, and save money using your phone. 4 bn check-ins 35mm users 50mm POI 150 employees 1tb+ a day of data.

tamber
Télécharger la présentation

Building and Improving Products with Hadoop Matthew Rathbone

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building and Improving Products with Hadoop Matthew Rathbone

  2. What is Foursquare • Foursquare helps you explore the world around you. • Meet up with friends, discover new places, and save money using your phone. • 4bn check-ins • 35mm users • 50mm POI • 150 employees • 1tb+ a day of data

  3. First, a story http://www.flickr.com/photos/shannonpatrick17

  4. The Right Tool for the Job • Nginx – Serving static files • Perl – Regular expressions • XML – Frustrating people • Hadoop (Map Reduce) – Counting

  5. Counting – what is it good for http://www.flickr.com/photos/blaahhi/

  6. Statistically Improbable Phrases Statistically Improbable Phrases

  7. SIPS use cases • menu extraction • sentiment analysis • venue ratings • specific recommendations • search indexing • pricing data • facility information

  8. How is SIPS built? Basically lots of counting.

  9. SIPS • Tokenize data with a language model (into N-Grams) • built using tips, shouts, menu items, likes, etc • Apply a TF-IDF algorithm (Term frequency, inverse document frequency) • Global phrase count • Local phrase count ( in a venue ) • Some Filtering and ranking • Re-compute & deploy nightly

  10. Why use Hadoop? http://www.flickr.com/photos/dbrekke/

  11. SIPS – Without Hadoop • Potential Problems • Database Query Throttling • Venues are out of sync • Altering the algorithm could take forever to populate for all venues • Where would you store the results? • What about debug data? • Does it scale to 10x, 100x? • What about other, similar workflows?

  12. SIPS – Hadoop Benefits • Quick Deployment • Modular & Reusable • Arbitrarily complex combination of many datasets • Every step of the workflow creates value

  13. Apple Store - Downtown San Francisco 1 tip mentions "haircuts" Search for "haircuts" in "san francisco"  Apple store??? Fixed by looking at % of tips and overall frequency “Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix-my-f!@#$-imac”

  14. Data & Modularity

  15. Actually, It’s a bit More complicated http://www.flickr.com/photos/bfishadow

  16. These benefits require infrastructure

  17. Dependency Management • Many options • Oozie (Apache) • Azkaban (LinkedIn) • Luigi ( Spotify, we <3 this ) • Hamake ( Codeminders ) • Chronos ( AirBNB)

  18. Database / Log Ingestion • Sqoop • Mongo-Hadoop • Kafka • Flume • Scribe • etc

  19. MapReduce Friendly Datastore • A few obvious ones: • Hbase • Cassandra • Voldemort • we built our own, it’s very similar to Voldemort and uses the Hfile API

  20. Getting started without all that stuff

  21. Components you likely don’t have

  22. The best way to start *but pretend you do Don’t use Hadoop.

  23. Other reasons to not use Hadoop • Your idea might not be very good • Hadoop will slow you down to start with • You don’t have enough infrastructure yet • build it when you need it • V1 might not be that complex • V1 could be a spreadsheet

  24. SIPS • Version 1 • Off the shelf language model • A subset of Venues & Tips • Did not use Map Reduce • Did not push to production at all

  25. SIPS • Version 2 • Started building our own language model • Rewritten as a Map Reduce • Manually loaded data to production • Filters for English data only. • Tweak, improve, etc

  26. SIPS • Version 3 • Incorporated more data sources into our language model • Deployment to KV store (auto) • Incorporated lots of debug output • Language pipeline also feeds sentiment analysis • Now we’re in the perfect place to iterate & improve

  27. …to explore data

  28. In Summary • Hadoop is good for counting, so use it for counting • Move quickly whenever possible and don’t worry about automation • Bring in new production services as you need them • Freedom!

  29. matthew@foursquare.com @rathboma Bonus: http://hadoopweekly.com from my colleague, Joe Crobak (presenting later!) Thanks!

More Related