220 likes | 406 Vues
NUMA-aware algorithms: the case of data shuffling. Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman. *University of Wisconsin - Madison. IBM Almaden Research Center. Hardware is a moving target. Intel-based. Cloud. POWER-based. 2-socket.
E N D
NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - Madison IBM Almaden Research Center
Hardware is a moving target Intel-based Cloud POWER-based 2-socket 4-socket (a) 4-socket (b) 8-socket Very difficult to optimize & maintain data management code for every HW platform Different degrees of parallelism, # sockets and memory hierarchies Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8-way SMT, …), storage technologies …
NUMA effects => underutilize RAM bandwidth Socket 0 Socket 1 1 1 Memory Memory QPI 2 3 2 3 Memory Memory 4 Socket 2 Socket 3 3 4 Sequential accesses are not the final solution
Use case: data shuffling Ignoring NUMA leaves perf. on the table • Each of the N threads need to send data to the N-1 other threads • Common operation: • Sort-merge join • Partitioned aggregation • MapReduce • Both Map and Reduce shuffle data • Scatter/gather
NUMA-aware data mgmt. operations There are many different data operations that need similar optimizations • Tons of work on SMPs & NUMA1.0 • Sort-merge join [Albutiu et al. VLDB 2012] • Favor sequential accesses over random probes • OLTP on HW Islands [Porobic et al. VLDB 2012] • Should we treat multisocket multicores as a cluster?
Need for primitives • Kernels used frequently on data management operations • E.g. sorting, hashing, data shuffling, … • Highly optimized software solutions • Similar to BLAS • Optimized by skilled devs per new HW platform • Hardware-based solutions • Database machines 2.0 (see Bionic DBMSs talk this afternoon) • If very important kernel, can be burnt into HW • Expensive, but orders of magnitude more efficient (perf., energy) • Companies like IBM and Oracle can do vertical engineering
Outline • Introduction • NUMA 2.0 and related work • Data shuffling • Ring shuffling • Thread migration • Evaluation • Conclusions
Data shuffling & naïve implementation Before After Shuffle • Naïve implementation • Each thread acting autonomously: • for (thread=0; thread<N; thread++) • readMyPartitionFrom(thread); How bad can that be? • N threads produce N-1 partitions for all other threads • Each thread needs to read its partitions • N * (N-1) transfers • Assume uniform sizes of partitions
Shuffling naively in a NUMA system Naïve uncoordinated shuffling Aggr. BW of all channels T7 T3 T5 T6 T1 T2 T4 T0 Step 1 T7 T3 T5 T6 T1 T2 T4 T0 Step 2 Need to orchestrate threads/transfers to utilize the rest T7 T3 T5 T6 T1 T2 T4 T0 Step 3 Max mem. BW of 1 channel T7 T3 T5 T6 T1 T2 T4 T0 Step … Usage of QPI and Memory paths BUT we bought 4 memory channels and 6 QPIs
Ring shuffling s3.t1 s0.t0 . s0.p0 s3.t0 s2.p3 s0.t1 . s2.p2 s1.p0 s1.p1 s2.p0 s2.t1 s1.t0 s0.p1 s3.p0 s2.t0 s1.t1 • Devise a global schedule and all threads follow it • Inner ring: partitions ordered by thread number, socket; stationary • Outer ring: threads ordered by socket, thread number; rotates • Can be executed in lock-step or loosely • Needs: • Thread binding & synchronization • Control location of mem. allocations
Ring shuffling in action Aggr. BW of all channels Ring shuffling T7 T3 T5 T6 T1 T2 T4 T0 Step 1 T7 T3 T5 T6 T1 T2 T4 T0 Step 2 T7 T3 T5 T6 T1 T2 T4 T0 Step 3 T7 T3 T5 T6 T1 T2 T4 T0 Step … Usage of QPI and Memory paths Orchestrated traffic utilizes underlying QPI network
Thread migration instead of shuffling Aggr. BW of all channels • Move computation to data instead of shuffling them • Convert accesses to local memory reads • Choice of migrating only thread or thread + state • But, both very sensitive to amount of thread state
Outline Introduction NUMA 2.0 and related work Data shuffling Evaluation Conclusions
Shuffling benchmark – peak bandwidth ~4x 3x IBM x3850 4 sockets x 8 cores Intel X7650 Nehalem-EX Fully connected QPI 2x IBM x3850 8 sockets x 8 cores Intel X7650 Nehalem-EX
Exploiting ring shuffling in joins Small overall perf. improvement because dominated by sort Implemented the algorithm of Albutiu et al. Sort-merge-based join implementation
Shuffling vs migration for aggregation Potential of thread migration when thread state small Partitioning-based aggregation
Conclusions Questions??? • Hardware is a moving target • Need for primitives for data management operations • Highly optimized SW or HW implementations • BLAS for DBMSs • Data shuffling can be up to 3x if NUMA-aware • Needs binding of memory allocations, thread scheduling … • Potential of thread migration • Improved overall performance of optimized joins and aggregations • Continue investigating primitives, their implementation and exploitation • Looking for motivated summer interns! [email to ipandis@us.ibm.com]
Shuffling data - scalability IBM x3850 4 sockets x 8 cores Fully connected QPI
Shuffling vs migration for aggregation - breakdown Partitioning-based aggregation
Naïve vs ring shuffling Naïve uncoordinated shuffling Coordinated shuffling T7 T3 T5 T6 T7 T1 T2 T4 T3 T5 T6 T0 T1 T2 T4 T0 Iteration 1 T7 T3 T5 T6 T7 T1 T2 T4 T0 T3 T5 T6 T1 T2 T4 T0 Iteration 2 T7 T3 T5 T6 T7 T1 T2 T4 T0 T3 T5 T6 T1 T2 T4 T0 Iteration 3 T7 T3 T5 T6 T1 T2 T4 T7 T3 T5 T6 T0 T1 T2 T4 T0 Iteration … Usage of QPI and Memory paths