350 likes | 488 Vues
This paper presents an in-depth evaluation of the Explicit Multi-Threading (XMT) architecture designed to optimize on-chip parallelism and improve the performance of parallel programs. With a focus on fine granularity and scalability, the XMT model introduces a computational framework that bridges architectural design and programmability. We detail the prototype environment, compiler optimizations, and experimental methodologies across various applications. Our results demonstrate significant speedups for both regular and irregular parallel applications, highlighting the advantages of explicit parallel programming.
E N D
Evaluating Multi-threading in the Prototype XMT Environment Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin Department of Computer Science University of Maryland, College Park
Introduction • 1 billion transistor chip around the corner • instruction-level parallelism (ILP) • single-chip multiprocessing (CMP) • simultaneous multi-threading (SMT) • Bridge the gap between on-chip parallelism and the parallel program • XMT: computational framework that encompasses both architecture design and programmability • programming model targets on-chip parallelism • explicit parallelism • fine granularity • scalability • wider range of applications
multiprogramming, IPC compiler threading, speculation XMT (Explicit Multi-Threading) • Goal: • single task completion time • speedups of parallel program over best serial program • Framework: • algorithm programming hardware • Architecture: CMP consisting of SMT units • scalability • high-bandwidth communication • efficient synchronization • Hardware prefix-sum primitive • Explicit parallel programming model • target general applications
Contributions • Prototype XMT environment • compiler • simulator • Experimental evaluation • wide range of applications (12 benchmarks) • parallel speedups over serial program • parallel applications: scalability to high levels • speedups for less parallel, irregular applications • Compiler optimization • thread coarsening
Outline • Motivation • XMT Programming Model • XMT Architecture • XMT Compiler • Experimental Evaluation • Conclusion
spawn (nthreads, off); { … xfork(); } • ps (base, incr) • base base + incr • return: initial value of base XMT Programming Model • Explicit spawn-join parallel regions • Independence of Order Semantics (IOS) • threads run to completion at own speed • no busy wait • Spawn statement - to generate a parallel region • Prefix-sum operation - for synchronization • Fork - dynamically increment the spawn size
5 “left” partition(lower than pivot) “right” partition(higher than pivot) 1 3 4 2 Example - Quicksort quicksort(input,n){ while(…){ partition(input,output); swap(input,output); } } Partition(input,output,n){ int pivot = p, low = 0, high = n; spawn(n, 0); { int indx; if (input[TID] < pivot){ indx = ps(low,1); output[indx] = input[TID]; }else{ indx = ps(hi,1); output[n-indx]=input[TID]; } } join(); }
Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion
XMT Architecture • Goals • exploit explicit parallelism • simplify hardware • maximize resource usage • decentralize design
XMT Architecture • Thread control units (TCU) • PCs, instruction fetch/decode, local registers • Clusters • multiple TCUs, L1 caches, functional units
TCU 1 TCU 2 TCU 3 TCU 4 TCU 5 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 XMT Execution Model • Serial mode: TCU 0 • spawn all TCUs join TCU 0 Spawn(10, 0); TCU 0 waitingwaiting Serial code Serial code Thread 8 Thread 6 Thread 7 Thread 9
XMT Architecture • Global functions • banked memory system • specialized global registers (pread,pset) • spawn unit(spawn) • prefix-sum unit (pinc) • Parallel prefix-sums efficiently combined
Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion
XMT Compiler • Front End: XPASS - SUIF pass • XMT-C C + specialized templates • parallel region separate procedure • assembly constructs for parallel execution • Supports thread coarsening • Back End: GCC • produces Simplescalar MIPS ISA
int TID, max_tid; TID = TCUID + offset; while(TID < max_tid){ }; Compilation Scheme • Phase 3: Templates replaced with XMT assembly • Phase 2: Spawn_function transformation • Phase 1: Outlining spawn_0_func() { int TID, max_tid; TID = TCUID + offset; while(TID < max_tid){ T H R E A D C O D E; }; } Main() { int global_vars; print(data); } pread PR0; pread ... TCU-init(nthreads, off); Spawn (nthreads,off); { THREAD CODE; } join(); spawn_0_func(); Spawn (nthreads,off); { THREAD CODE; } join(); spawn_setup(nthreads, off); pset PR0, pset ... TID = get_new_id() pinc PR1 $tid Spawn-end(); halt/suspend;
Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion
Experimental Methodology • Simulator • SimpleScalar parameters for instruction latencies • 1, 4, 16, 64, 256 TCUs • Configuration: • 8 TCUs per cluster • 8K L1 cache • banked shared L2 cache 1MB • Programs rewritten in XMT • Speedups of parallel XMT program compared to best serial program • parallel applications: scalability to high levels • speedups for less parallel, irregular applications
First Application Set • Computation: • regular, • mostly array based, • limited synchronization needed
Speedups over serial • Speedups scale up to 256 TCUs • memory • overheads • coarsening
Overheads • Problem size less than 0.01% of total execution time
Overheads • Problem size • Extremely-fine granularity
Thread clustering impact on overheads 64 X 64 128 X 128 256 X 256 problem size:
Second Application Set • Computation: • irregular, • unpredictable • synchronization needed
Speedups over serial • Dynamic load balancing • Dynamic forking • Exploiting fine-granularity
Dynamic Load Balancing • dag (initial step of computation) • 256 nodes, 9679 edges • A spawn block, 16 TCUS
Conclusion • XMT as a complete environment • Extensive experimental evaluation on a range of applications and computations • speedups scale up to 256 TCUs for parallel applications • better speedups for less parallel applications
Related Work • On Chip: • SMT • CMP • MultiScalar • M-Machines • Raw • Multithreaded architectures: • Tera
Current & Future Work • Compiler optimizations • Enlarge benchmark suite • Detailed simulator
Example - dot product • dot(A,B,lb,ub){ • int dot; • for (i = lb; i < ub ; i++){ • dot += A[i]*B[i]; • } • return dot; • } int input, global_val = 0; spawn(nthreads, 0); { int lb = N*TID/nthreads; int ub = N*(TID+1)/nthreads; int my_part; my_part = dot(input,lb,ub); ps(&global_val, my_part); } join();
Fork operation. Example - Quick-Sort (2) int input[N],thread_data[N]; fspawn(1, 0); { int my_size/start = thread_data[TID]; while(my_size > 1){ int pivot = f(my_size); int low, high = g(my_start,my_size); hi/low_partition <-- ser_partition(); xfork(thread_data, high_partition); my_size = high - my_start; } } join();