1 / 44

Hadoop @

Hadoop @. … and other stuff. Who Am I?. Staff. I'm this guy!. http://www.linkedin.com/in/richardbhpark rpark@linkedin.com. Hadoop… what is it good for?. Directly influenced by Hadoop. Indirectly influenced by Hadoop. Additionally, 50% for business analytics.

sadie
Télécharger la présentation

Hadoop @

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop @ … and other stuff

  2. Who Am I? Staff I'm this guy! • http://www.linkedin.com/in/richardbhpark • rpark@linkedin.com

  3. Hadoop… what is it good for? Directly influenced by Hadoop Indirectly influenced by Hadoop Additionally, 50% for business analytics

  4. A long long time ago (or 2009) 40 Million members Apache Hadoop 0.19 20 node cluster Machines built from Frys(pizza boxes) PYMK in 3 days!

  5. Now-ish Over 5000 nodes 6 clusters (1 production, 1 dev, 2 etl, 2 test) Apache Hadoop 1.04 (Hadoop 2.0 soon-ish) Security turned on About 900 users 15-20K Hadoop Jobs submissions a day PYMK < 12 hours!

  6. Current Setup • Use Avro (mostly) • Dev/Adhoc cluster • Used for development and testing of workflows • For analytic queries • Prod clusters • Data that will appear on our website • Only reviewed workflows • ETL clusters • Walled off

  7. Three Common Problems Hadoop Cluster (not to scale) Process Data Data Out Data In

  8. Data In

  9. Databases (c. 2009-2010) • Originally pulled directly through JDBC on backup DB • Pulled deltas when available and merged • Data comes extra late (wait for replication of replicas) • Large data pulls affected by daily locks • Very manual. Schema, connections, repairs (manual) • No delta’s meant no Scoop • Costly (Oracle) DWH Live Site Offline Data copy Live Site Hadoop Offline Data copy Live Site Offline Data copy 24 hr 5-12 hr

  10. Databases (Present) Commit logs/deltas from Production Copied directly to HDFS Converted/merged to Avro Schema is inferred Hadoop Live Site Live Site Live Site < 12 hr

  11. Databases (Future 2014?) Diffs sent directly to Hadoop Avro format Lazily merge Explicit schema Hadoop Databus ( < 15 min ) Datastores

  12. Webtrack (c. 2009-2011) Flat files (xml) Pulled from every servers periodically, grouped and gzipped Uploaded into Hadoop Failures nearly untraceable Hadoop ? NAS NAS I seriously don’t know how many hops and copies

  13. Webtrack (Present) Apache Kafka!! Yay! Avro in, Avro out Automatic pulls into Hadoop Auditing Hadoop Kafka Kafka Kafka Kafka Kafka Kafka Kafka Kafka 5-10 mins end to end

  14. Apache Kafka • LinkedIn Events • Service metrics • Use schema registry • Compact data (md5) • Auto register • Validate schema • Get latest schema • Migrating to Kafka 0.8 • Replication

  15. Apache Kafka + Hadoop = Camus • Avro only • Uses zookeeper • Discover new topics • Find all brokers • Find all partitions • Mappers pull from Kafka • Keeps offsets in HDFS • Partitions into hour • Counts incoming events

  16. Kafka Auditing Use Kafka to Audit itself Tool to audit and alert Compare counts Kafka 0.8?

  17. Lesson’s We Learned • Avoid lots of small files • Automation with Auditing = sleep for me • Group similar data = smaller/faster • Spend time writing to spend less time reading • Convert to binary, partition, compress • Future: • adaptive replication (higher for new, lower for old) • Metadata store (hcat) • Columnar store (Orc?, Parquett?)

  18. Processing Data

  19. Pure Java • Time consuming writing jobs • Little code re-use • Shoot yourself in the face • Only used when necessary • Performance • Memory • Lots of libraries to help (boiler plate stuff)

  20. Little Piggy (Apache Pig) Mainly a pigsty (Pig 11.0) Used by data products Transparent Good performance, tunable UDF’s, Datafu Tuples and bags? WTF

  21. Hive • Hive 11 • Only for Adhoc queries • Biz ops, PM’s, analyst • Hard to tune • Easy to use • Lots of adoption • Etl data in external tables :/ • Hive server 2 for JDBC Disturbing Mascot

  22. Future in Processing Giraph Impala, Shark/Spark… etc Tez Crunch Other? Say no to streaming

  23. Workflows

  24. Azkaban • Run hadoop jobs in order • Run regular schedules • Be notified on failures • Understand how flows are executed • View execution history • Easy to use

  25. Azkaban @ LinkedIn • Used in LinkedIn since early 2009 • Powers all our Hadoop data products • Been using 2.0+since late 2012 • 2.0 and 2.1 quietly released early 2013

  26. Azkaban @ LinkedIn • One Azkaban instance per cluster • 6 clusters total • 900 Users • 1500 projects • 10,000 flows • 2500 flow executing per day • 6500 jobs executing per day

  27. Azkaban (before) Engineer designed UI...

  28. Azkaban 2.0

  29. Azkaban Features • Schedules DAGs for executions • Web UI • Simple job files to create dependencies • Authorization/Authentication • Project Isolation • Extensible through plugins (works with any version of Hadoop) • Prison for dark wizards

  30. Azkaban - Upload • Zip Job files, jars, project files

  31. Azkaban - Execute

  32. Azkaban - Schedule

  33. Azkaban - Viewer Plugins HDFS Browser Reportal

  34. Future Azkaban Work • Higher availability • Generic Triggering/Actions • Embedded graphs • Conditional branching • Admin client

  35. Data Out

  36. Voldemort Distributed Key-Value Store Based on Amazon Dynamo Pluginable Open-source

  37. Voldemort Read-Only Filesystem store for RO Create data files and index on Hadoop Copy data to Voldemort Swap

  38. Voldemort + Hadoop • Transfers are parallel • Transfer records in bulk • Ability to Roll back • Simple, operationally low maintenance • Why not Hbase, Cassandra? • Legacy, and no compelling reason to change • Simplicity is nice • Real answer: I don’t know. It works, we’re happy.

  39. Apache Kafka Reverse the flow Messages produced by Hadoop Consumer upstream takes action Used for emails, r/w store updates, where Voldemort doesn’t make sense etc

  40. Nearing the End

  41. MiscHadoop at LinkedIn • Believe in optimization • File size, task count and utilization • Reviews, culture • Strict limits • Quotas size/file count • 10K task limit • Use capacity scheduler • Default queue with 15m limit • marathon for others

  42. We do a lot with little… • 50-60% cluster utilization • Or about 5x more work than some other companies • Every job is reviewed for production • Teaches good practices • Schedule to optimize utilization • Prevents future headaches • These keep our team size small • Since 2009, hadoop users grew 90x, clusters grew 25x, LinkedIn employees grew 15x • hadoop team 5x (to 5 people)

  43. More info Our data site: data.linkedin.com Kafka: kafka.apache.com Azkaban: azkaban.github.io/azkaban2 Voldemort: project-voldemort.com

  44. The End

More Related