1 / 55

Building LinkedIn’s Real-time Data Pipeline

Building LinkedIn’s Real-time Data Pipeline. Jay Kreps. What is a data pipeline?. What data is there?. Database data Activity data Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and System Metrics Rrdtool , graphite, etc Logs Syslog , log4j, etc.

ashtyn
Télécharger la présentation

Building LinkedIn’s Real-time Data Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building LinkedIn’s Real-time Data Pipeline Jay Kreps

  2. What is a data pipeline?

  3. What data is there? • Database data • Activity data • Page Views, Ad Impressions, etc • Messaging • JMS, AMQP, etc • Application and System Metrics • Rrdtool, graphite, etc • Logs • Syslog, log4j, etc

  4. Data Systems at LinkedIn • Search • Social Graph • Recommendations • Live Storage • Hadoop • Data Warehouse • Monitoring Systems • …

  5. Problem: Data Integration

  6. Point-to-Point Pipelines

  7. Centralized Pipeline

  8. How have companies solved this problem?

  9. The Enterprise Data Warehouse

  10. Problems • Data warehouse is a batch system • Central team that cleans all data? • One person’s cleaning… • Relational mapping is non-trivial…

  11. My Experience

  12. LinkedIn’s Pipeline

  13. LinkedIn Circa 2010 • Messaging: ActiveMQ • User Activity: In house log aggregation • Logging: Splunk • Metrics: JMX => Zenoss • Database data: Databus, custom ETL

  14. 2010 User Activity Data Flow

  15. Problems • Fragility • Multi-hour delay • Coverage • Labor intensive • Slow • Does it work?

  16. Four Ideas • Central commit log for all data • Push data cleanliness upstream • O(1) ETL • Evidence-based correctness

  17. Four Ideas • Central commit log for all data • Push data cleanliness upstream • O(1) ETL • Evidence-based correctness

  18. What kind of infrastructure is needed?

  19. Very confused • Messaging (JMS, AMQP, …) • Log aggregation • CEP, Streaming

  20. First Attempt: Don’t reinvent the wheel!

  21. Problems With Messaging Systems • Persistence is an afterthought • Ad hoc distribution • Odd semantics • Featuritis

  22. Second Attempt: Reinvent the wheel!

  23. Idea: Central, Distributed Commit Log

  24. What is a commit log?

  25. Data Flow

  26. Apache Kafka

  27. Some Terminology • Producers send messages to Brokers • Consumers read messages from Brokers • Messages are sent to a Topic • Each topic is broken into one or more ordered partitions of messages

  28. APIs • send(String topic, String key, Message message) • Iterator<Message>

  29. Distribution

  30. Performance • 50MB/sec writes • 110MB/sec reads

  31. Performance

  32. Performance Tricks • Batching • Producer • Broker • Consumer • Avoid large in-memory structures • Pagecache friendly • Avoid data copying • sendfile • Batch Compression

  33. Kafka Replication • In 0.8 release • Messages are highly available • No centralized master

  34. Kafka Info http://incubator.apache.org/kafka

  35. Usage at LinkedIn • 10 billion messages/day • Sustained peak: • 172,000 messages/second written • 950,000 messages/second read • 367 topics • 40 real-time consumers • Many ad hoc consumers • 10k connections/colo • 9.5TB log retained • End-to-end delivery time: 10 seconds (avg)

  36. Datacenters

  37. Four Ideas • Central commit log for all data • Push data cleanliness upstream • O(1) ETL • Evidence-based correctness

  38. Problem • Hundreds of message types • Thousands of fields • What do they all mean? • What happens when they change?

  39. Make activity data part of the domain model

  40. Schema free?

  41. LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float)

  42. Schemas • Structure can be exploited • Performance • Size • Compatibility • Need a formal contract

  43. Avro Schema • Avro – data definition and schema • Central repository of all schemas • Reader always uses same schema as writer • Programatic compatibility model

  44. Workflow • Check in schema • Code review • Ship

  45. Four Ideas • Central commit log for all data • Push data cleanliness upstream • O(1) ETL • Evidence-based correctness

  46. Hadoop Data Load • Map/Reduce job does data load • One job loads all events • Hive registration done automatically • Schema changes handled transparently • ~5 minute lag on average to HDFS

  47. Four Ideas • Central commit log for all data • Push data cleanliness upstream • O(1) ETL • Evidence-based correctness

More Related