1 / 42

Demystifying Systems for Interactive and Real-time Analytics

Demystifying Systems for Interactive and Real-time Analytics. The BigFrame Team. Duke University, Hong Kong Polytechnic University, and HP Labs. Analytics System Landscape. Streaming. Dataflow. MapReduce. Graph. Multi-tenant. MPP DB. Array DB. Columnar. Mixed. Text Analytics.

kalona
Télécharger la présentation

Demystifying Systems for Interactive and Real-time Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Demystifying Systems for Interactive and Real-time Analytics The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs

  2. Analytics System Landscape Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics

  3. Analytics System Landscape Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics SQL Server Parallel Data Warehouse Teradata Gamma DB2 PE Aster Greenplum Netezza

  4. Analytics System Landscape Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics HP Vertica Redshift Vectorwise ParAccel

  5. Analytics System Landscape Hadoop Hive HadoopDB Mahout Pig Streaming Tenzing Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics

  6. Analytics System Landscape Dremel Dryad Spark SCOPE Drill Impala Stinger Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics

  7. Analytics System Landscape Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics Bigtable HBase HANA Megastore Cassandra Spanner Druid Splunk

  8. Analytics System Landscape Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics Storm Streambase ElasticSearch Pregel GraphLab GraphX Cloudera Search Cassovary SciDB MadLINQ HAMA Solr

  9. Analytics System Landscape Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Mesos Text Analytics Serengeti YARN Cloud platforms

  10. What does this mean for Big Data Practitioners? Streaming Dataflow MapReduce Graph Multi-tenant MPP DB Array DB Columnar Mixed Text Analytics

  11. Gives them a lot of power! From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html

  12. Even the mighty may need a little help

  13. Challenges for Practitioners • Features (e.g., graph data) • Performance (e.g., claims like System A is 50x faster than B) • Resource efficiency • Growth and scalability • Multi-tenancy Which system to use for the app that I am developing? App Developers, Data Scientists

  14. Challenges for Practitioners Different parts of my app have different requirements Which system to use for the app that I am developing? App Developers, Data Scientists Compose “best of breed” systems OR Use “one size fits all” system? Managing many systems is hard! System Admins

  15. Challenges for Practitioners Different parts of my app have different requirements Which system to use for the app that I am developing? App Developers, Data Scientists Total Cost of Ownership (TCO)? Managing many systems is hard! CIO System Admins

  16. Numbers make decisions easier

  17. Need benchmarks

  18. One Approach Categorize systems Develop a benchmark per system category

  19. Useful, But … Terasort HiBench Linear Road TPC-H / TPC-DS Graph 500 PageRank RDF Benchmarks Star Schema Benchmark DFSIO SWIM Streaming Counting triangles GridMix MulTe Dataflow MapReduce Graph Multi-tenant MPP DB Array DB SS-DB Columnar Mixed MapReduce Vs. Parallel DB / Hive Benchmark (in HiBench)/ Berkeley Big Data Benchmark Information Extraction Benchmark Text Analytics Yahoo Cloud Serving Benchmark (YCSB) YCSB Variants CH-benchCHmark

  20. Problem #1 May Miss the Big Picture

  21. Problem #1 May Miss the Big Picture Cannot capture the complexities and end-to-end behavior of big data applications and deployments: (i) Bottlenecks(ii) Data conversion, transfer, & loading overheads(iii) Storage costs & other parts of the data life-cycle(iv) Resource management challenges(v) Total Cost of Ownership (TCO)

  22. Problem #2 Benchmark Give a man a fish and you will feed him for a day. Give him fishing gear and you will feed him for life. -- Anonymous Benchmark Generator

  23. BigFrame: A Benchmark Generator for Big Data Analytics

  24. How a user uses BigFrame bigif (benchmark input format) BigFrame Interface Benchmark Generator bspec (benchmark specification) results Hive Benchmark Driver for System Under Test run the benchmark Map HBase Reduce System Under Test

  25. bspec: Benchmark Specification System Under Test 3. Query streams Hive 4. Evaluation metrics Map HBase Reduce 2. Data refresh pattern 1. Data for initial load Time

  26. What does the user (want to) specify? bigif (benchmark input format) BigFrame Interface

  27. The 3Vs Velocity Streaming Variety Dataflow MapReduce Graph Multi-tenant Volume MPP DB Array DB Columnar Mixed Text Analytics

  28. bigif: BigFrame’sInputFormat Data Volume Query concurrency & classes Data Variety Query Volume Small, medium, large Relational, text, array, graph Data Velocity Micro, Macro At rest, slow, fast Exploratory, Continuous Query Variety Query Velocity

  29. Benchmark Generation bigif (benchmark input format) bspec (benchmark specification) Benchmark Generator 1. Initial data to load 2. Data refresh pattern 3. Query streams 4. Evaluation metrics bigif describes points in a discrete space of {Data,Query} X {Variety,Volume,Velocity} Benchmark generation can be addressed as a search problem within a rich application domain

  30. Application Domain Modeled Currently E-commerce sales, promotions, recommendations Social media sentiment & influence Benchmark generation can be addressed as a search problem within a rich application domain

  31. Application Domain Modeled Currently Tweets Item Web_sales Customer Relationships Promotion

  32. Application Domain Modeled Currently Web_sales Promotion Item

  33. Application Domain Modeled Currently

  34. Benchmark Generation bigif (benchmark input format) bspec (benchmark specification) Benchmark Generator bigif describes points in a discrete space of {Data,Query} X {Variety,Volume,Velocity} 1. Initial data to load 2. Data refresh pattern 3. Query streams 4. Evaluation metrics BigFrame can generate Data, Queries, and Arrival Patterns with the user-specified {Variety,Volume,Velocity} requirements from the application domain

  35. Use Cases of BigFrame

  36. Use Case I: Exploratory BI Data Variety = {Relational} • Large volumes of relational data • Mostly aggregation and few joins • Can Spark’s performance match that of an MPP DB? Query Variety = Micro BigFrame will generate a benchmark specification containing relational data and (SQL-ish) queries

  37. Use Case II: Complex BI • Large volumes of relational data Data Variety = {Relational, Text} • Even larger volumes of text data • Combined analytics Query Variety = Macro (application-focused instead of micro-benchmarking) BigFrame will generate a benchmark specification that includes sentiment analysis tasks over tweets

  38. Use Case III: Dashboards Data Velocity = Fast • Large volume and velocity of relational and text data Query Velocity = Continuous (as opposed to Exploratory) • Continuously-updated Dashboards BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results change upon data refresh

  39. Use Case IV: Does One Size Fit All? • Growing set of applications have to process relational, text, & graph data • Compose “best of breed” systems or use a “one size fits all” system? Data Variety = {Relational, Text, Graph} Query Variety = Macro BigFrame will generate a benchmark specification that includes composite workflows with relational, text, and graph analytics

  40. Use Case V: Multi-tenancy and SLAs • Big data deployments are increasingly multi-tenant and need to meet SLAs Specified through Query Volume dimension BigFrame can generate a benchmark specification containing a specified number of concurrent query streams with class labels for queries (e.g., Batch, Interactive, or Streaming)

  41. Working with the Community • First release of BigFrame planned for August 2013 • With feedback from benchmark developers (BigBench) • Open-source with extensibility APIs • Benchmark Drivers for more systems • Utilities (accessed through the Benchmark Driver to drill down into system behavior during benchmarking) • Instantiate the BigFrame pipeline for more app domains

  42. Take Away • “Benchmarks shape a field (for better or worse) …” • -- David Patterson, Univ. of California, Berkeley • Benchmarks meet different needs for different people • End customers, application developers, system designers, system administrators, researchers, CIOs • BigFramehelps users generate benchmarks that best meet their needs

More Related