Enhancing Parallel Querying in Non-Dedicated Nodes: Efficient Work Allocation Strategies

Parallel Querying with Non-Dedicated Nodes Vijayshankar Raman, Wei Han, Inderpal Narang IBM Almaden Research Center

Properties of a relational database • Ease of schema evolution • Declarative Querying • Transparent scalability does not quite work Parallel Querying with Non-Dedicated Computers

L1 L3 L2 O1 O3 O2 Sa Sc Sb Today: Partitioning is basis for parallelism • static partitioning (on the base tables) • Dynamic partitioning via exchange operators • Claim: partitioning does not handle non-dedicated nodes well Parallel Querying with Non-Dedicated Computers

initial partitioning Problems of partitioning • Hard to scale incrementally • Data must be re-partitioned • Disk and CPU must be scaled together • DBA must ensure partition-cpu affinity • Homogeneity Assumptions • Same plan runs on each node • Identical software needed on all nodes • Susceptible to load variations, node failures / stalls, … • Response time is dictated by speed of slowest processor • Bad for transient compute resources • E.g. we want ability to interrupt query work by higher-priority local work exchange Parallel Querying with Non-Dedicated Computers

GOAL: A more graceful scale-out solution Sacrifice partitioning for scalability • Avoid initial partitioning • No exchange New means for work allocation in absence of partitioning • Handles heterogeneity and load variations better • Two Design Features • Data In The Network (DITN) • Shared files on high speed networks (e.g SAN) • Intra-Fragment Parallelism • Send SQL fragments to heterogeneous join processors:each performs the same join, over a different subset of cross-product space • Easy fault-tolerance • Can use heterogeneous nodes -- whatever is available at that time Parallel Querying with Non-Dedicated Computers

Outline • Motivation • DITN design • Experimental Results • Summary Parallel Querying with Non-Dedicated Computers

DITN Architecture • Find idle coprocessors P1, P2, P3, P4, P5, P6 • Prepare O, L, C • Logically divide OxLxC into workunits Wi • In Parallel, Run SQL queries for Wi at Pi • Property: SPJAG(OxLxC) = AG (iSPJAG(Wi)) Restrictions (will return to this at the end) • Pi cannot use indexes at info. Integrator • Isolation issues Parallel Querying with Non-Dedicated Computers

Why Data in the Network • Observation: Network bandwidth >> Query Operator Bandwidth • N/W bandwidth: in Gbps (SAN/LAN),Scan: 10-100 Mbps, Sort: about 10 Mbps • Interconnect transfers data faster than query operators can process it • But, exploiting this fast interconnect via SQL is tricky • E.g. ODBC Scan: 10x slower than local scan • Instead, keep temp files in a shared storage system (e.g. SAN-FS) • Allows exploitation of full n/w bandwidth • immediate benefits • Fast data transfer • DBMS doesn’t have to worry about disks, i/o ||ism, || scans, etc. • Independent scaling of CPU and I/O Parallel Querying with Non-Dedicated Computers

Work Allocation without Partitioning • For each join: we now have to join the off-diagonal rectangles also • Minimize Response time = max(RT of each work-unit) = maxi,j JoinCost(|Li|, |Oj|) • How to optimize the Work allocation? • ~ cut join hyper-rectangle into n pieces to minimize max perimeter • Simplification: assume that the join is cut into a grid • Choices: number of cuts on each table, size of each cut, allocation of work-units to processors Parallel Querying with Non-Dedicated Computers

Allocation to homogenous processors • Theorem: For monotonic JoinCost, RT is minimized when each cut (on a table) is of same size • So allocation done into rectangles of size |T1|/p1, |T2|/p2, … |Tn|/pn • Theorem: For symmetric JoinCost, RT is minimized when |T1|/p1 = |T2|/p2 = … |Tn|/pn • E.g., with 10 processors, cut Lineitem into 5 parts and Orders into 2 • Note: cutting each table into same number of partitions (as is done usually) is sub-optimal Parallel Querying with Non-Dedicated Computers

Allocation to heterogeneous co-processors • Response time of queryRT = max(RT of each work-unit)Choose size of each work-unit, and allocation of work-units to co-processor, so as to minimize RT • Like a bin packing problem • Solve for number of cuts on each table, assuming homogeneity • Then solve a Linear Program to find the optimal size of each cut • Have to make some approximations in order to avoid Integer Program (see paper) Parallel Querying with Non-Dedicated Computers

Failure/Stall Resiliency by Work-Unit Reassignment Without tuple shipping between plans, failure handling is easy • If co-processor’s A,B,C finished by time X,and co-processor D has not finished by time X(1+f) • Take D’s work unit and assign to fastest among A,B,C – say A • When either of D or A returns, close the cursor on the other • Can generalize to a work-stealing scheme • E.g. with 10 coprocessors, assign each to 1/20th of the cross-product space • When a coprocessor returns with a result, assign it more work • Tradeoff: Finer work allocation => more flexible work-stealing BUT, more redundant work Parallel Querying with Non-Dedicated Computers

Analysis: What do we lose by not partitioning • Say join of L x O x C (TPC-H) with 12 processors: 12 = p1p2p3 • RT without partitioning ~ JoinCost(|L|/p1, |O|/p2 , |C|/p3) • RT with partitioning ~ JoinCost(|L|/p1p2p3, |O|/p1p2p3, |C|/p1p2p3) • At p1=6, p2=2, p3=1, loss in CPU speedup is JoinCost(|L|/6, |O|/2, |C| ) ~ 2JoinCost(|L|/12, |O|/12, |C|/12) • Note: I/O speedup is unaffected • Can close the gap with partitioning further • Sort the largest tables of the join: e.g. |L|, |O| on their join column • Now, loss is: JoinCost(|L|/12,|O|/12,|C|) / JoinCost(|L|/12, |O|/12,|C|/12) • Still avoids exchange => can use heterogeneous, non-dedicated nodes,but causes problems with isolation Optimization: selective clustering Parallel Querying with Non-Dedicated Computers

Lightweight Join Processor • Work Allocation via Query Fragments => co-processors can be heterogeneous • Need not have a full DBMS; join processor is enough • E.g. screen saver for join processing  • We use a trimmed down version of Apache Derby • Parse CSV files • Predicates, projections, sort-merge joins, aggregates, group by Parallel Querying with Non-Dedicated Computers

Outline • Motivation • DITN design • Experimental Results • Summary Parallel Querying with Non-Dedicated Computers

Performance degradation due to not partitioning O  L SOL • At 10 nodes on SxOxLxCxNxR,DITN is about 2.1x slower than PBP(Work alloc: L/5, O/2, S, C, N, R) • DITN2PART has very little slowdown • But needs total clustering • Slow-down oscillates due to discreteness of work-allocation SOLCNR Parallel Querying with Non-Dedicated Computers

Failure/Stall Resiliency by Work-Unit Reassignment • Orders x Lineitemgroup by o_orderpriority5 co-processors • Impose high load on oneco-processor as soon as query begins • At 60% load (50% wait), DITN times out and switches to alternative PBP DITN2PART Parallel Querying with Non-Dedicated Computers

Importance of Asymmetric Allocation • Initially 2 fast nodes: then add 4 slow nodes • With symmetric allocation: adding slow nodes can slow down system Contrast between DITN-symmetric and DITN-asymmetric Parallel Querying with Non-Dedicated Computers

Danger of Tying partition to CPU • Repeated execution of O  L • Impose 75% CPU load on one of the 5 co-processors during 3rd iteration • PBP continues to use this slow node throughout • DITN switches to another node after two iterations Parallel Querying with Non-Dedicated Computers

Related Work • Parallel query processing – Gamma, XPRS, many commercial systems • Mostly shared-nothing • Shared-disk: IBM Sysplex • Queries done via tuple shipping between co-processors • Oracle • Shared disk, but hash joins done via partitioning (static/dynamic) • Mariposa – similar query fragment level work allocation • Load Balancing Exchange, Flux, River, Skew-avoidance in hash joins • Fault-tolerant exchange (FLUX) • Polar*, OGSA-DQP • Distributed Eddies • Query Execution on P2P systems Parallel Querying with Non-Dedicated Computers

Summary and Future work • Partitioning-based parallelism does not handle non-dedicated nodes • Proposal: Avoid partitioning • Share data via storage system • Intra-fragment parallelism instead of exchange • Careful work-allocation to optimize response time • Promising initial results: only 2x slowdown with 10 nodes • Index scans: want shared reads without latching • Isolation: DITN: uncommitted read; DITN2PART: read-only • Scaling to large numbers of nodes • Multi-query optimization to reuse shared temp tables Open Questions Parallel Querying with Non-Dedicated Computers

Backup Slides Parallel Querying with Non-Dedicated Computers

Enhancing Parallel Querying in Non-Dedicated Nodes: Efficient Work Allocation Strategies