270 likes | 384 Vues
This paper presents a hardware-based solution for job queue management in manycore architectures, addressing scalability challenges associated with increasing core counts. By introducing IsoNet, a lightweight micro-network that facilitates dynamic load balancing and fault tolerance, the study shows how job queuing can be optimized to minimize conflicts and reduce overhead in multithreading scenarios. Experimental evaluations demonstrate the effectiveness of IsoNet in managing job distribution across numerous processing elements in OpenMP environments, showcasing significant improvements in performance and fault resilience.
E N D
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, ChrysostomosNicopoulos, Yongjae Lee, HyungGyu Lee and Jongman Kim Presented by Junghee Lee
Introduction • Manycore systems • Number of cores is increasing • Challenges in scalability • Memory • Power consumption • Cache coherence protocol • Load balancing
Contents • Introduction • Background • Programming models • Motivation • IsoNet • Fault-tolerance • Evaluation • Conclusion
Programming Models • Parallel programming models • MPI • OpenMP • Fine-grained parallelism • Emerging applications:Recognition, Mining and Synthesis • Execution time of each computation kernel is very short but it has abundant parallelism • Excessive overhead in multithreading
Job Queuing • Creates jobs instead of threads • One thread per core is created • Thread: a set of instructions and states of execution • Job: a set of data that is processed by a thread • Job queue • Manages the list of jobs • Maintains load balance Job Job Job Thread Thread CPU CPU
Conflicts in Job Queue • Chance of conflicts increases as: • The number of cores increases • The time taken to update the job queue increases • The job queue is accessed more frequently (job is short) • Previous approaches • Distributed queues • Load balance is maintained by job-stealing • The chance of conflicts in one local queue is decreased • Hardware implementation • Time spent on updating the queue is reduced
Profile of SMVM Conflicts Stealing job Processing job 1.0 Ratio of execution time 0.8 0.6 0.4 0.2 0 4 128 256 8 16 32 64 Number of cores
Objectives • Requirements of load balancer • Scalability: conflict-free • Fault-tolerance • The probability of faults increases exponentially as technology scales • Contributions of this paper • Light weight micro-network for load balancing • Scalable even with more than a thousand cores • Comprehensive fault-tolerance support
Contents • Introduction • Background • IsoNet • Architecture • Implementation • Fault-tolerance • Evaluation • Conclusion
System View I I I CPU CPU CPU R R R I I I CPU CPU CPU R R R
Microarchitecture of IsoNet Node Job Count Job Count MUX MUX Max Selector Min Selector Comp Comp Switch MUX Job Job DEMUX Dual Clock Stack
How It Works 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 Tree-based routing: for fault-tolerance
Single Cycle Implementation • Estimated critical path delay • 11.38 ns (87.8 MHz) • By Elmore delay model • Single cycle implementation offers low hardware cost Leaf node Int. node Root node Int. node Src or Dest Swt Swt Src node Dest node
Hardware Cost Estimation 674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)
Contents • Introduction • Background • IsoNet • Fault-tolerance • Transparent mode • Reconfiguration mode • Evaluation • Conclusion
Supporting Fault-Tolerance • Transparent mode • For faulty CPUs • Bypass the corresponding IsoNet node • Reconfiguration mode • For faulty IsoNet node • Operation • When a fault is detected, all IsoNet nodes go into the reconfiguration mode • Reconfigure the topology of IsoNet so that the faulty node is excluded • Assign a new root node if the root node fails
Reconfiguration 3 3 3 2 3 1 3 3 2 2 2 1 3 0 3 1 3 2 3 3 3 2 3 1 3 2 3 3 Root Node Candidate
Contents • Introduction • Background • IsoNet • Fault-tolerance • Evaluation • Experimental setup • Results • Conclusion
Experimental Setup • Simulation framework • Wind River’s Simics full-system simulator • CMP with 4~64 x86 compatible cores • Fedora 12 with kernel 2.6.33 • Benchmarksfrom recognition, mining and synthesis applications • GS: Gauss-Seidel • MMM: Dense Matrix-Matrix Multiply • SVA: Scaled Vector Addition • MVM: Dense Matrix Vector Multiply • SMVM: Sparse Matrix Vector Multiply
Results MMM (6,473 instructions) SMVM (2,872 instructions) 50 25 7 14 Execution time (107 cycles) Execution time (107 cycles) 45 6 12 40 20 Speed up 5 Speed up 10 35 15 30 8 4 25 6 3 10 20 2 4 15 5 1 10 2 5 0 0 4 8 16 32 64 4 8 16 32 64 Number of cores Number of cores Job stealing Carbon IsoNet IsoNet speed up Carbon speedup
Beyond Hundred Cores • MMM (6,473 instructions) 1.0 Relative Execution Time 0.8 0.6 0.4 0.2 0 128 4 8 16 32 64 256 512 1024 Number of cores Carbon IsoNet
Profile of IsoNet Conflicts Stealing job Processing job 1.0 Ratio of execution time 0.8 0.6 0.4 0.2 0 4 8 16 32 64 Number of cores
Conclusion • Scalability is one of key challenges in manycore domain • Scalability in load balancing is critical to utilize a number of processing elements • This paper proposes a novel hardware-based dynamic load distributor and balancer, called IsoNet • IsoNet also provides comprehensive fault-tolerance support • Experimental results in a full-system simulation with real applications demonstrate that IsoNet scales better than alternative techniques
Questions? Contact info Junghee Lee junghee.lee@gatech.edu Electrical and Computer Engineering Georgia Institute of Technology