1 / 23

BigBench: Big Data Benchmark Proposal

BigBench: Big Data Benchmark Proposal. Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen. BigBench. End to end benchmark Based on a product retailer Focus on Parallel DBMS MR engines

studs
Télécharger la présentation

BigBench: Big Data Benchmark Proposal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen

  2. BigBench • End to end benchmark • Based on a product retailer • Focus on • Parallel DBMS • MR engines • Initial work presented at 1st WBDB, San Jose • Collaboration with Industry & Academia • Teradata, University of Toronto, InfoSizing , Oracle • Full spec at 3rd WBDB, Xian, China

  3. Outline • Data Model • Variety, Volume, Velocity • Data Generator • PDGF for structured data • Enhancement : Semi-structured & Text generation • Workload specification • Main driver: retail big data analytics • Covers : data source, declarative & procedural and machine learning algorithms. • Metrics • Evaluation • Done on Teradata Aster

  4. Data Model • Big Data is not about size only • 3 V’s • Variety • Different types of data • Volume • Huge size • Velocity • Continuous updates

  5. Data Model(Variety)

  6. Data Model(continued) • Volume • Based on scale factor • Similar to TPC-DS scaling • Weblogs & product reviews also scaled • Velocity • Periodic refreshes for all data • Different velocity for different areas • Vstructured < Vunstructured < Vsemistructured • Queries run with refresh

  7. Data Generator • "Parallel Data Generation Framework“ PDGF • For the structured part of model • Scale factor similar to TPC-DS • Extensions to PDGF • Web logs • Product reviews • Web logs: retail customers/guests visiting site • Web logs similar to apache web server logs • Coupled with structured data • Product reviews: Customers and guest users • Algorithm based on Markov chain • Real data set sample input • Coupled with structured data

  8. Data GeneratorTextGen • Input • real review data • Category information • Rating • Review text • Initial setup • Parse text • Produce tokens • Correlation between tokens • Build repository by category • API for data generation • Integrated with PDGF • Based on category ID

  9. Workload (continued) • Workload Queries • 30 queries • Specified in English • No required syntax • Business functions(Adapted from McKinsey+) • Marketing • Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences • Merchandising • Assortment optimization, Pricing optimization • Operations • Performance transparency, Product return analysis • Supply chain • Inventory management • Reporting (customers and products)

  10. Workload (continued) Technical Functions • Data source dimension • Structured • Semi-structured • Un-structured • Processing type dimension • Declarative (SQL, HQL) • Procedural • Mix of both • Analytic technique dimension • Statistical analysis: correlation analysis, time-series, regression • Data mining: classification, clustering, association mining, pattern analysis and text analysis • Simple reporting: ad hoc queries not covered above

  11. Workload (continued)Business Categories Query Breakdown

  12. Workload (continued)Technical Dimensions Breakdown

  13. Technical Dimensions Breakdown(continued)

  14. Metrics • Future Work • Initial thoughts • Focus on loading and type of processing • MR engines good at loading • DBMS good at SQL • Metric = • Applicable to single and multi-streams

  15. Evaluation • BigBench proof of concept • DBMS • Typically data loaded into tables • Possibly parsing weblogs to get schema • Reviews captured as VARCHAR or BLOB fields • Queries run using SQL + UDF • MR engine • Data can be loaded on HDFS • MR, HQL, PigLatin can be used • DBMS & MR engine • DBMS with Hadoop connectors • Data can be placed and split among both • Processing can also be split among two

  16. Evaluation (continued) • Teradata Aster • MPP database • Discovery platform • Supports SQL-MR • System • Aster 5.0 • 8 node cluster • 200 GB of data

  17. Evaluation (continued) • Data generation • DSDGen produced structured part • PDGF+ produced semi-structured and un-structured • Data loaded into tables • Parsed weblogs into Weblogs table • Product reviews table • Queries • SQL-MR syntax

  18. Evaluation (continued) • Example query • Product category affinity analysis • Computes the probability of browsing products from a category after customers viewed items from another category. • One form of market basket • Business case • Marketing • cross-selling • Type of source • Structured (web sales) • Processing type • mix of declarative and procedural • Analytics type • Data mining

  19. Evaluation (continued) SELECT category_cd1 AS category1_cd , category_cd2 AS category2_cd , COUNT (*) AS cnt FROM basket_generator( ON ( SELECT i. i_category_id AS category_cd , s. ws_bill_customer_sk AS customer_id FROM web_saless INNER JOIN item i ON s. ws_item_sk = i_item_sk ) PARTITION BY customer_id BASKET_ITEM (' category_cd ') ITEM_SET_MAX (500) ) GROUP BY 1,2 order by 1 ,3 ,2;

  20. Evaluation(Sample Queries)

  21. Evaluation(Processing Type)

  22. Conclusion • End to end big data benchmark • Data Model • Adding semi-structured and unstructured data • Different velocity for structured, semi-structured and unstructured • Volume based on scale factor • Data Generation • Unstructured data using Markov chain model. • Enhancing PDGF for semi-structured and integrating it with unstructured • Workload • 30 queries based on McKinsey report • Covers processing type, data source and data mining algorithms • Evaluation • Aster SQL-MR

  23. Future Work • Add late binding to web logs • No schema/keys upfront • Finalizing • Data generator • Metric specifications. • Downloadable kit • More POC’s • Hadoop ecosystem like HIVE

More Related