1 / 46

Benchmarking Datacenter and Big Data Systems

Benchmarking Datacenter and Big Data Systems. Wanling Gao , Zhen Jia , Lei Wang, Yuqing Zhu, Chunjie Luo , Yingjie Shi, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, Bizhu Qiu , Lixin Zhang, Jianfeng Zhan. http://prof.ict.ac.cn/jfzhan . Acknowledgements.

xanto
Télécharger la présentation

Benchmarking Datacenter and Big Data Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarking Datacenter and BigData Systems WanlingGao, Zhen Jia, Lei Wang, Yuqing Zhu, ChunjieLuo, Yingjie Shi, Yongqiang He, Shiming Gong, XiaonaLi, ShujieZhang, BizhuQiu, Lixin Zhang, Jianfeng Zhan http://prof.ict.ac.cn/jfzhan

  2. Acknowledgements • This work is supported by the Chinese 973 project (Grant No.2011CB302502), the Hi-Tech Research and Development (863) Program of China (Grant No.2011AA01A203, No.2013AA01A213), the NSFC project (Grant No.60933003, No.61202075) , the BNSFproject (Grant No.4133081), and Huawei funding.

  3. Executive summary • An open-source project on datacenter and big data benchmarking ICTBench • http://prof.ict.ac.cn/ICTBench • Several case studies using ICTBench

  4. Question One • Gap between Industry and Academia • Longer and longer distance • Code • Data sets

  5. Question Two • Different benchmark requirements • Architecture communities • Simulation is very slow • Small data and code sets • System communities • Large-scale deployment is valuable. • Users need real-world applications • There are three kinds of lies: lies, damn lies, and benchmarks

  6. State-of-Practice Benchmark Suites PARSEC SPEC Web SPEC CPU HPCC TPCC Gridmix YCSB

  7. No benchmark suite covers diversity of data center workloads State-of-art: CloudSuite Only includes six applications according to their popularity Why a New Benchmark Suite for Datacenter Computing

  8. Why a New Benchmark Suite (Cont’) CloudSuite our benchmark suite Memory Level Parallelism(MLP): Simultaneously outstanding cache misses DCBench MLP

  9. Why a New Benchmark Suite (Cont’) • Scale-out performance DCBench Cloudsuite Data analysis benchmark Speed up Working nodes

  10. Outline • Background and Motivation • Our ICTBench • Case studies

  11. ICTBench Project • ICTBench: three benchmark suites • DCBench: architecture (application, OS, and VM execution) • BigDataBench: system (large-scale big data applications) • CloudRank: Cloud benchmarks (distributed managements) not covered in this talk • Project homepage • http://prof.ict.ac.cn/ICTBench • The source code is available

  12. DCBench • DCBench: typical data center workloads • Different from scientific computing: FLOPS • Cover applications in important domains • Search engine, electronic commence etc. • Each benchmark = a single application • Purposes • Architecture • system (small-to-medium) researches

  13. BigDataBench • Characterizing big data applications • Not including data-intensive super computing • Synthetic data sets varying from 10G~ PB • Each benchmark = a single big application. • Purposes • large-scale system and architecture researches

  14. CloudRank • Cloud computing • Elastic resource management • Consolidating different workloads • Cloud benchmarks • Each benchmark = a group of consolidated data center workloads. • services/ data processing/ desktop • Purposes • Capacity planning, system evaluation and researches • User can customize their benchmarks.

  15. To decide and rank main application domains according to a publicly available metric e.g. page view and daily visitors To single out the main applications from main applications domains Benchmarking Methodology

  16. Top Sites on the Web Top Sites on the Web More details in http://www.alexa.com/topsites/global;0

  17. To decide and rank main application domains according to a publicly available metric e.g. page view and daily visitors To single out the main applications from main applications domains Benchmarking Methodology

  18. Main algorithms in Search Engine Algorithms used in Search: Pagerank Graph mining Segmentation Feature Reduction Grep Statistical counting Vector calculation sort Recommendation …… Top Sites on The Web

  19. Main Algorithms in Search Engines (Nutch) Merge Sort Vector calculate PageRank Segmentation Scoring & Sort Word Grep Word Count Segmentation Sort Classification DecisionTree BFS

  20. Main Algorithms in Social Networks Algorithms used in Social Network: Recommendation Clustering Classification Graph mining Grep Feature Reduction Statistical counting Vector calculation Sort …… Top Sites on The Web

  21. Main Algorithms in Electronic Commerce Algorithms used in electronic commerce: Recommendation Associate rule mining Warehouse operation Clustering Classification Statistical counting Vector calculation …… Top Sites on The Web

  22. Overview of DCBench

  23. Overview of DCBench (Cont’)

  24. Methodology of Generating Big Data Characteristic Analysis Expand Small-scale Data Big Data Word reuse distance • e.g. word frequency Word distribution in documents • To preserve the characteristics of real-world data

  25. Workloads in BigDataBench 1.0 Beta • Analysis Workloads • Simple but representative operations • Sort, Grep, Wordcount • Highly recognized algorithms • Naïve Bayes, SVM • Search Engine Service Workloads • Widely deployed services • Nutch Server

  26. Variety of Workloads are Included

  27. Features of Workloads

  28. Content • Background and Motivation • Our ICTBench • Case studies

  29. Use Case 1: Microarchitecture Characterization • Using DCBench • Five nodes cluster • one mater and four slaves(working nodes) • Each node:

  30. Instructions Execution level • DCBench: • Data analysis workloads have more app-level instructions • Service workloads have higher percentages of kernel-level instructions Data analysis service

  31. Pipeline Stall • DC workloads have severe front end stall (i.e. instruction fetch stall) • Services: more RAT(Register Allocation Table) stall • Data analysis: more RS(Reservation Station) and ROB(ReOrder Buffer)full stall

  32. Architecture Block Diagram

  33. Front End Stall Reasons • For DC, High Instruction cache miss andInstruction TLB miss make the front end inefficiency

  34. MLC Behaviors • DC workloads have more MLC misses than HPC • Data analysis workloads own better locality (less L2 cache misses) Service Data analysis HPCC

  35. LLC Behaviors • LLC is good enough for DC workloads • Most L2 cache misses can be satisfied by LLC

  36. DTLB Behaviors • DC workloads own more DTLB miss than HPC • Most data analysis workloads have less DTLB miss Service Data analysis HPCC

  37. Branch Prediction • DC: • Data analysis workloads have pretty good branch behaviors • Service’sbranch is hard to predict Service Data analysis HPCC

  38. DC Workloads Characteristics • Data analysis applications share many inherent characteristics, which place them in a different class from desktop, HPC, traditional server and scale-out service workloads. • More details can be found at our IISWC 2013 paper. • Characterizing Data Analysis Workloads in Data Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization (IISWC-2013)

  39. Use Case 2: Architecture Research • Using BigDataBench 1.0 Beta • Data Scale • 10 GB – 2 TB • Hadoop Configuration • 1 master 14 slave node

  40. Use Case 2: Architecture Research • Some micro-architectural events are tending towards stability when the data volume increases to a certain extent • Cache and TLB behaviors have different trends with increasing data volumes for different workloads • L1I_miss/1000ins: increase for Sort, decrease for Grep

  41. Search Engine Service Experiments • Same phenomena is observed • Micro-architectural events are tending towards stability when the index size increases to a certain extent • Big data impose challenges to architecture researches since large-scale simulation is time-consuming • Index size:2GB ~ 8GB • Segment size:4.4GB ~ 17.6GB

  42. Use Case 3: System Evaluation • Using BigDataBench 1.0 Beta • Data Scale • 10 GB – 2 TB • Hadoop Configuration • 1 master 14 slave node

  43. System Evaluation • a threshold for each workload • 100MB ~ 1TB • System is fully loaded when the data volume exceeds the threshold • Sort is an exception • An inflexion point(10GB ~ 1TB) • Data processing rate decreases after this point • Global data access requirements • I/O and network bottleneck • System performance is dependent on applications and data volumes.

  44. Conclusion • ICTBench • DCBench • BigDataBench • CloudRank • An open-source project on datacenter and big data benchmarking • http://prof.ict.ac.cn/ICTBench

  45. Publications • BigDataBench: a Big Data Benchmark Suite from Web Search Engines. WanlingGao, et al. The Third Workshop on Architectures and Systems for Big Data (ASBD 2013) in conjunction with ISCA 2013. • Characterizing Data Analysis Workloads in Data Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization (IISWC-2013) • Characterizing OS behavior of Scale-out Data Center Workloads. Chen Zheng et al. Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013). In Conjunction with ISCA 2013.[ • Characterization of Real Workloads of Web Search Engines. Huafeng Xi et al. 2011 IEEE International Symposium on Workload Characterization (IISWC-2011). • The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems. Zhen Jia et al. Second workshop of big data benchmarking (WBDB 2012 India) & Lecture Note in Computer Science (LNCS) • CloudRank-D: Benchmarking and Ranking Cloud Computing Systems for Data Processing Applications. ChunjieLuo et al. Front. Comput. Sci. (FCS) 2012, 6(4): 347–362

  46. Thank you! Any questions?

More Related