220 likes | 337 Vues
This paper discusses advancements in parallel querying with non-dedicated computers, focusing on the challenges and solutions surrounding partitioning, scalability, and load balancing. It critiques traditional static partitioning approaches and proposes a shift to dynamic work allocation methods that leverage Data In The Network (DITN) architecture. By optimizing query processing without strict partitioning, the paper presents novel techniques for handling heterogeneous nodes and enhancing fault tolerance, ultimately aiming for improved performance and resource utilization in relational databases.
E N D
Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center
Properties of a relational database • Ease of schema evolution • Declarative Querying • Transparent scalability does not quite work Parallel Querying with Non-Dedicated Computers
L1 L3 L2 O1 O3 O2 Sa Sc Sb Today: Partitioning is basis for parallelism • static partitioning (on the base tables) • Dynamic partitioning via exchange operators • Claim: partitioning does not handle non-dedicated nodes well Parallel Querying with Non-Dedicated Computers
initial partitioning Problems of partitioning • Hard to scale incrementally • Data must be re-partitioned • Disk and CPU must be scaled together • DBA must ensure partition-cpu affinity • Homogeneity Assumptions • Same plan runs on each node • Identical software needed on all nodes • Susceptible to load variations, node failures / stalls, … • Response time is dictated by speed of slowest processor • Bad for transient compute resources • E.g. we want ability to interrupt query work by higher-priority local work exchange Parallel Querying with Non-Dedicated Computers
GOAL: A more graceful scale-out solution Sacrifice partitioning for scalability • Avoid initial partitioning • No exchange New means for work allocation in absence of partitioning • Handles heterogeneity and load variations better • Two Design Features • Data In The Network (DITN) • Shared files on high speed networks (e.g SAN) • Intra-Fragment Parallelism • Send SQL fragments to heterogeneous join processors:each performs the same join, over a different subset of cross-product space • Easy fault-tolerance • Can use heterogeneous nodes -- whatever is available at that time Parallel Querying with Non-Dedicated Computers
Outline • Motivation • DITN design • Experimental Results • Summary Parallel Querying with Non-Dedicated Computers
DITN Architecture • Find idle coprocessors P1, P2, P3, P4, P5, P6 • Prepare O, L, C • Logically divide OxLxC into workunits Wi • In Parallel, Run SQL queries for Wi at Pi • Property: SPJAG(OxLxC) = AG (iSPJAG(Wi)) Restrictions (will return to this at the end) • Pi cannot use indexes at info. Integrator • Isolation issues Parallel Querying with Non-Dedicated Computers
Why Data in the Network • Observation: Network bandwidth >> Query Operator Bandwidth • N/W bandwidth: in Gbps (SAN/LAN),Scan: 10-100 Mbps, Sort: about 10 Mbps • Interconnect transfers data faster than query operators can process it • But, exploiting this fast interconnect via SQL is tricky • E.g. ODBC Scan: 10x slower than local scan • Instead, keep temp files in a shared storage system (e.g. SAN-FS) • Allows exploitation of full n/w bandwidth • immediate benefits • Fast data transfer • DBMS doesn’t have to worry about disks, i/o ||ism, || scans, etc. • Independent scaling of CPU and I/O Parallel Querying with Non-Dedicated Computers
Work Allocation without Partitioning • For each join: we now have to join the off-diagonal rectangles also • Minimize Response time = max(RT of each work-unit) = maxi,j JoinCost(|Li|, |Oj|) • How to optimize the Work allocation? • ~ cut join hyper-rectangle into n pieces to minimize max perimeter • Simplification: assume that the join is cut into a grid • Choices: number of cuts on each table, size of each cut, allocation of work-units to processors Parallel Querying with Non-Dedicated Computers
Allocation to homogenous processors • Theorem: For monotonic JoinCost, RT is minimized when each cut (on a table) is of same size • So allocation done into rectangles of size |T1|/p1, |T2|/p2, … |Tn|/pn • Theorem: For symmetric JoinCost, RT is minimized when |T1|/p1 = |T2|/p2 = … |Tn|/pn • E.g., with 10 processors, cut Lineitem into 5 parts and Orders into 2 • Note: cutting each table into same number of partitions (as is done usually) is sub-optimal Parallel Querying with Non-Dedicated Computers
Allocation to heterogeneous co-processors • Response time of queryRT = max(RT of each work-unit)Choose size of each work-unit, and allocation of work-units to co-processor, so as to minimize RT • Like a bin packing problem • Solve for number of cuts on each table, assuming homogeneity • Then solve a Linear Program to find the optimal size of each cut • Have to make some approximations in order to avoid Integer Program (see paper) Parallel Querying with Non-Dedicated Computers
Failure/Stall Resiliency by Work-Unit Reassignment Without tuple shipping between plans, failure handling is easy • If co-processor’s A,B,C finished by time X,and co-processor D has not finished by time X(1+f) • Take D’s work unit and assign to fastest among A,B,C – say A • When either of D or A returns, close the cursor on the other • Can generalize to a work-stealing scheme • E.g. with 10 coprocessors, assign each to 1/20th of the cross-product space • When a coprocessor returns with a result, assign it more work • Tradeoff: Finer work allocation => more flexible work-stealing BUT, more redundant work Parallel Querying with Non-Dedicated Computers
Analysis: What do we lose by not partitioning • Say join of L x O x C (TPC-H) with 12 processors: 12 = p1p2p3 • RT without partitioning ~ JoinCost(|L|/p1, |O|/p2 , |C|/p3) • RT with partitioning ~ JoinCost(|L|/p1p2p3, |O|/p1p2p3, |C|/p1p2p3) • At p1=6, p2=2, p3=1, loss in CPU speedup is JoinCost(|L|/6, |O|/2, |C| ) ~ 2JoinCost(|L|/12, |O|/12, |C|/12) • Note: I/O speedup is unaffected • Can close the gap with partitioning further • Sort the largest tables of the join: e.g. |L|, |O| on their join column • Now, loss is: JoinCost(|L|/12,|O|/12,|C|) / JoinCost(|L|/12, |O|/12,|C|/12) • Still avoids exchange => can use heterogeneous, non-dedicated nodes,but causes problems with isolation Optimization: selective clustering Parallel Querying with Non-Dedicated Computers
Lightweight Join Processor • Work Allocation via Query Fragments => co-processors can be heterogeneous • Need not have a full DBMS; join processor is enough • E.g. screen saver for join processing • We use a trimmed down version of Apache Derby • Parse CSV files • Predicates, projections, sort-merge joins, aggregates, group by Parallel Querying with Non-Dedicated Computers
Outline • Motivation • DITN design • Experimental Results • Summary Parallel Querying with Non-Dedicated Computers
Performance degradation due to not partitioning O L SOL • At 10 nodes on SxOxLxCxNxR,DITN is about 2.1x slower than PBP(Work alloc: L/5, O/2, S, C, N, R) • DITN2PART has very little slowdown • But needs total clustering • Slow-down oscillates due to discreteness of work-allocation SOLCNR Parallel Querying with Non-Dedicated Computers
Failure/Stall Resiliency by Work-Unit Reassignment • Orders x Lineitemgroup by o_orderpriority5 co-processors • Impose high load on oneco-processor as soon as query begins • At 60% load (50% wait), DITN times out and switches to alternative PBP DITN2PART Parallel Querying with Non-Dedicated Computers
Importance of Asymmetric Allocation • Initially 2 fast nodes: then add 4 slow nodes • With symmetric allocation: adding slow nodes can slow down system Contrast between DITN-symmetric and DITN-asymmetric Parallel Querying with Non-Dedicated Computers
Danger of Tying partition to CPU • Repeated execution of O L • Impose 75% CPU load on one of the 5 co-processors during 3rd iteration • PBP continues to use this slow node throughout • DITN switches to another node after two iterations Parallel Querying with Non-Dedicated Computers
Related Work • Parallel query processing – Gamma, XPRS, many commercial systems • Mostly shared-nothing • Shared-disk: IBM Sysplex • Queries done via tuple shipping between co-processors • Oracle • Shared disk, but hash joins done via partitioning (static/dynamic) • Mariposa – similar query fragment level work allocation • Load Balancing Exchange, Flux, River, Skew-avoidance in hash joins • Fault-tolerant exchange (FLUX) • Polar*, OGSA-DQP • Distributed Eddies • Query Execution on P2P systems Parallel Querying with Non-Dedicated Computers
Summary and Future work • Partitioning-based parallelism does not handle non-dedicated nodes • Proposal: Avoid partitioning • Share data via storage system • Intra-fragment parallelism instead of exchange • Careful work-allocation to optimize response time • Promising initial results: only 2x slowdown with 10 nodes • Index scans: want shared reads without latching • Isolation: DITN: uncommitted read; DITN2PART: read-only • Scaling to large numbers of nodes • Multi-query optimization to reuse shared temp tables Open Questions Parallel Querying with Non-Dedicated Computers
Backup Slides Parallel Querying with Non-Dedicated Computers