1 / 49

Staged Database Systems

Staged Database Systems. Thesis Oral Stavros Harizopoulos. Database world: a 30,000 ft view. internet. offload data. DBMS. Sarah: “Buy this book”. DSS: Decision Support Systems few long-running queries. Jeff: “Which store needs more advertising?”. OLTP: Online Transaction Processing

marinel
Télécharger la présentation

Staged Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Staged Database Systems Thesis Oral Stavros Harizopoulos

  2. Database world: a 30,000 ft view internet offload data DBMS Sarah: “Buy this book” DSS: Decision Support Systems few long-running queries Jeff: “Which store needs more advertising?” OLTP: Online Transaction Processing many short-lived requests DB systems fuel most e-applications Improved performance Impact on everyday life

  3. New HW/SW requirements CPU memory • More capacity, throughput efficiency • CPUs run much faster than they can access data today the ‘80s 1 cycle 10 300 DSS stress I/O subsystem Need to optimize all levels of memory hierarchy

  4. The further, the slower • Keep data close to CPU • Locality and predictability is key Overlap mem. accesses with computation Modify algorithms and structures to exhibit more locality DBMS core design contradicts above goals

  5. Thread-based execution in DBMS x thread pool no coordination • Queries are handled by a pool of threads • Threads execute independently • No means to exploit common operations D C DBMS StagedDB D C New design to expose locality across threads

  6. Staged Database Systems StagedDB • Organize system components into stages • No need to change algorithms / structures Stage 3 DBMS Stage 1 Stage 2 queries queries High concurrency locality across requests

  7. Thesis “By organizing and assigning system components into self-contained stages, database systems can exploit instruction and data commonality across concurrent requests thereby improving performance.”

  8. Summary of main results 20-40% (OLTP) variable (DSS) memory hierarchy • 56% - 96% fewer I-misses • STEPS: full-system evaluation on Shore • 1.2x - 2x throughput • QPipe: full-system evaluation on BerkeleyDB L1 D I L2-L3 D I RAM Disks

  9. Contributions and dissemination • Introduced StagedDB design • Scheduling algorithms for staged systems • Built novel query engine design • QPipe engine maximizes data and work sharing • Addressed instruction cache in OLTP • STEPS applies to any DBMS with few changes CIDR’03 IEEE Data Eng. ’05 CMU-TR’02 SIGMOD’05 ICDE’06 demo sub. CMU-TR’05 HDMS’05 VLDB J. subm. VLDB’04 TODS subm.

  10. Outline • Introduction • QPipe • STEPS • Conclusions D I DSS

  11. Query-centric design of DB engines • Queries are evaluated independently • No means to share across queries • Need new design to exploit common data instructions work across operators

  12. QPipe: operator-centric engine • Conventional: “one-query, many-operators” • QPipe: “one operator, many-queries” • Relational operators become mEngines • Queries break up in tasks and queue up queue runtime QPipe conventional

  13. QPipe design packet dispatcher Q mEngine-A Q Q Q mEngine-J conventional design query plans A mEngine-S J S S thread pool storage engine read read write

  14. Reusing data & work in QPipe • Detect overlap at run time • Shared pages and intermediate results are simultaneously pipelinedto parent nodes Q2 Q2 Q1 Q1 simultaneous pipelining overlap in red operator

  15. Mechanisms for sharing QPipe complements above approaches • Multi-query optimization • Materialized views • Buffer pool management • Shared scans • RedBrick, Teradata, SQL Server not used in practice requires workload knowledge opportunistic limited use

  16. Experimental setup • QPipe prototype • Built on top of BerkeleyDB, 7,000 C++ lines • Shared-memory buffers, native OS threads • Platform • 2GHz Pentium 4, 2GB RAM, 4 SCSI disks • Benchmarks • TPC-H (4GB)

  17. Sharing order-sensitive scans A Q2 order-insensitive S order-sensitive M-J M-J I I + I I ORDERS LINEITEM M-J I I A Q1 TPC-H Query 4 S M-J I I LINEITEM ORDERS • Two clients send query at different intervals • QPipe performs 2 separate joins

  18. Sharing order-sensitive scans total response time (sec) time difference between arrivals • Two clients send query at different intervals • QPipe performs 2 separate joins

  19. TPC-H workload • Clients use pool of 8 TPC-H queries • QPipe reuses large scans, runs up to 2x faster • ..while maintaining low response times throughput (queries/hr) number of clients

  20. QPipe: conclusions • DB engines evaluate queries independently • Limited existing mechanisms for sharing • QPipe requires few code changes • SP is simple yet powerful technique • Allows dynamic sharing of data and work • Other benefits (not described here) • I-cache, D-cache performance • Efficiently execute MQO plans

  21. Outline • Introduction • QPipe • STEPS • Conclusions OLTP D I

  22. Online Transaction Processing Max on-chip L2/L3 cache 10MB 1MB Cache size L1-I sizes for various CPUs 100KB 10KB ‘96 ‘98 ‘00 ‘02 ‘04 Year Introduced • High-end servers, non I/O bound • L1-I stalls are 20-40% of execution time • Instruction caches cannot grow Need solution for instruction cache-residency

  23. Related work • Hardware and compiler approaches • Increased block size, stream buffer[Ranganathan98] • Code layout optimizations[Ramirez01] • Database software approaches • Instruction cache for DSS [Padmanabhan01][Zhou04] • Instruction cache for OLTP: Challenging!

  24. STEPS for cache-resident code U D • multiplex execution, • reuse instructions S S U S S S S S D S S S still larger than I-cache keep thread model, insert sync points Transaction STEPS:Synchronized Transactions through Explicit Processor Scheduling • Microbenchmark: eliminate 96% of L1-I misses • TPC-C: eliminate 2/3 of misses, 1.4 speedup Begin Select Update Insert Delete Commit

  25. I-cache aware context-switching instruction cache no STEPS with STEPS thread 1 thread 2 thread 2 thread 1 select( ) s1 s2 s3 select( ) s1 s2 s3 s4 s5 s6 s7 M M M M Miss M M M M M M M select( ) s1 s2 s3 Hit H H H code fits in I-cache select( ) s1 s2 s3 s4 s5 s6 s7 M M M M M M M M s4 s5 s6 s7 M M M M context-switch (CTX) point s4 s5 s6 s7 H H H H

  26. Placing CTX calls in source mem. address for CTX lines to insert CTX instruction mem. refs … … DBMS binary … STEPS simulation gdb valgrind file1.c:30 0x01 0x01 0x04 file2.c:40 0x05 0x05 … … 0x04 … AutoSTEPS tool Evaluation • Comparable performance to manual • ..while being more conservative

  27. Experimental setup (1st part) • Implemented on top of Shore • AMD AthlonXP • 64KB L1-I + 64KB L1-D, 256KB L2 • Microbenchmark • Index fetch, in-memory index • Fast CTX for both systems, warm cache

  28. Microbenchmark: L1-I misses AthlonXP 4K 3K L1-I cache misses 2K 1K STEPSeliminates 92-96% of misses for add’l threads 6 8 10 2 4 1 Concurrent threads

  29. L1-I misses & speedup 40 40 20 20 60 60 80 80 10 10 30 30 50 50 70 70 Concurrent threads Concurrent threads AthlonXP 100% 80% Miss reduction 60% 40% 1.4 1.3 Speedup 1.2 1.1 • Steps achieves max performance for 6-10 threads • No need for larger thread groups

  30. Challenges in full-system operation So far: • Threads are interested in same Op • Uninterrupted flow • No thread scheduler Full-system requirements • High concurrency on similar Ops • Handle exceptions • Disk I/O, locks, latches, abort • Co-exist with system threads • Deadlock detection, buffer pool housekeeping

  31. System design Op X Op Y Xactions Xactions STEPS wrapper STEPS wrapper Op Z to other Op execution team stray thread • Fast CTX through fixed scheduling • Repair thread structures at exceptions • Modify only thread package STEPS wrapper

  32. Experimental setup (2nd part) • AMD AthlonXP • 64KB L1-I + 64KB L1-D, 256KB L2 • TPC-C (wholesale parts supplier) • 2GB RAM, 2 disks • 10-30 Warehouses (1-3GB), 100-300 users • Zero think time, in-memory, lazy commits

  33. One transaction: payment 100% 80% 60% Normalized count Number of users 40% 20% • STEPSoutperforms baseline system • 1.4 speedup, 65% fewer L1-I misses Cycles L1-I misses

  34. Mix of four transactions 100% 80% 60% Normalized count Number of users 40% 20% Cycles L1-I misses • Xaction mix reduces team size • Still, 56% fewer L1-I misses

  35. STEPS: conclusions • STEPS can handle full OLTP workloads • Significant improvements in TPC-C • 65% fewer L1-I misses • 1.2 – 1.4 speedup STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity

  36. StagedDB: future work • Promising platform for Chip-Multiprocessors • DBMS suffer from CPU-to-CPU cache misses • StagedDB allows work to follow data -- not the other way around! • Resource scheduling • Stages cluster requests for DB locks, I/O • Potential for deeper, more effective scheduling

  37. Conclusions • New hardware, new requirements • Server core design remains the same • Need new design to fit modern hardware StagedDB: Optimizes all memory hierarchy levels Promising design for future installations

  38. The speaker would like to thank: his academic advisor Anastassia Ailamaki his thesis committee members Panos K. Chrysanthis, Christos Faloutsos, Todd C. Mowry, and Michael Stonebraker and his coauthors Kun Gao, Vladislav Shkapenyuk, and Ryan Williams Thank you

  39. QPipe backup

  40. A mEngine in detail relational operator code mEngine mEngine simultaneous pipelining queue main routine parameters scheduling thread busy threads free threads Padmanabhan01 (ICDE) Zhou04 (SIGMOD) Harizopoulos04 (VLDB) Zhou03 (VLDB) • tuple batching I-cache • query grouping I&D-cache

  41. Simultaneous Pipelining in QPipe SP coordinator join join join attach 2 4 3 1 Q1 Q1 Q2 Q2 write Q2 Q1 Q2 Q1 read Q2 Q1 pipeline Q1 Q2 copy COMPLETE Q2 Q2 Q1 Q1 with SP without SP

  42. Sharing data & work across queries Query 3 min Query 2 work sharing opportunity max M-J data sharing opportunity S S S TABLE A TABLE B TABLE A A Query 1 : “Find average age of students enrolled in both class A and class B” M-J S S TABLE A TABLE B

  43. Sharing opportunities at run time SP coordinator Q2 sharing potential R R R pipeline Q2 Q1 Q2 Q1 write read read • Q1 executes operator R • Q2 arrives with R in its plan result production for R in Q1 result production for R in Q2 with SP without SP

  44. TPC-H workload average response time think time (sec) • Clients use pool of 8 TPC-H queries • QPipe reuses large scans, runs up to 2x faster • ..while maintaining low response times throughput (queries/hr) number of clients

  45. STEPS backup

  46. Smaller L1-I cache 10 threads 209% AthlonXP, Pentium III 120% 100% 80% Normalized count 60% 40% 20% Instr. stalls (cycles) Cycles Branches Br. Mispred. L1-I misses L1-D misses Br. missed BTB • Steps outperforms Shore even on smaller caches (PIII) • 62-64% fewer mispredicted branches on both CPUs

  47. SimFlex: L1-I misses AthlonXP 10 threads 64b cache block 10K 8K 6K L1-I cache misses 4K 2K full direct higher associativity higher associativity 8-way 2-way 4-way • Steps eliminates all capacity misses (16, 32KB caches) • Up to 89% overall miss reduction (upper limit is 90%)

  48. One Xaction: payment Branches L2-D L1-D L2-I L1-I mispred. misses misses misses misses Number of Warehouses 100% 80% 60% Normalized count 40% 20% • Steps outperforms Shore • 1.4 speedup, 65% fewer L1-I misses • 48% fewer mispredicted branches Cycles

  49. Mix of four Xactions Branches L2-D L1-D L2-I L1-I mispred. misses misses misses misses Number of Warehouses 121% 125% 100% 80% 60% Normalized count 40% 20% Cycles • Xaction mix reduces average team size (4.3 in 10W) • Still, Steps has 56% fewer L1-I misses (out of 77% max)

More Related