430 likes | 545 Vues
Design and Implementation of the CCC Parallel Programming Language. Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University. Outline. Introduction The CCC programming language The CCC compiler Performance evaluation Conclusions. Motivations.
E N D
Design and Implementationof the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University
Outline • Introduction • The CCC programming language • The CCC compiler • Performance evaluation • Conclusions ICS2004
Motivations • Parallelismis the future trend • Programming in parallelismuch more difficult thanprogramming in serial • Parallel architectures arevery diverse • Parallel programming models arevery diverse ICS2004
Motivations • Design a parallel programming language that uniformly integrates various parallel programming models • Implement a retargetable compiler for this parallel programming language on various parallel architectures ICS2004
Approaches to Parallelism • Library approach • MPI (Message Passing Interface), pthread • Compiler approach • HPF (High Performance Fortran), HPC++ • Language approach • Occam, Linda, CCC (Chung Cheng C) ICS2004
Models of Parallel Architectures • Control Model • SIMD: Single Instruction Multiple Data • MIMD: Multiple Instruction Multiple Data • Data Model • Shared-memory • Distributed-memory ICS2004
Models of Parallel Programming • Concurrency • Control parallelism:simultaneously execute multiple threads of control • Data parallelism:simultaneously execute the same operations on multiple data • Synchronization and communication • Shared variables • Message passing ICS2004
Granularity of Parallelism • Procedure-level parallelism • Concurrent execution of procedures on multiple processors • Loop-level parallelism • Concurrent execution of iterations of loops on multiple processors • Instruction-level parallelism • Concurrent execution of instructions on a single processor with multiple functional units ICS2004
The CCC Programming Language • CCC is a simple extension of C and supports bothcontrolanddataparallelism • A CCC program consists of a set ofconcurrentandcooperativetasks • Control parallelism runs inMIMDmode and communicates viashared variablesand/ormessage passing • Data parallelism runs inSIMDmode and communicates viasharedvariables ICS2004
Control Parallel Data Parallel Tasks in CCC Programs ICS2004
Control Parallelism • Concurrency • task • par and parfor • Synchronization and communication • shared variables – monitors • message passing – channels ICS2004
Monitors • The monitor construct is a modular and efficient construct for synchronizing shared variables among concurrent tasks • It provides data abstraction, mutual exclusion, and conditional synchronization ICS2004
Customer Customer Customer An Example - Barber Shop Barber Chair ICS2004
An Example - Barber Shop task::main( ) { monitorBarber_Shop bs; int i; par { barber( bs ); parfor (i = 0; i < 10; i++) customer( bs ); } } ICS2004
An Example - Barber Shop task::barber(monitorBarber_Shop in bs) { while (1 ) { bs.get_next_customer( ); bs.finished_cut( ); } } task::customer(monitorBarber_Shop in bs) { bs.get_haircut( ); } ICS2004
An Example - Barber Shop monitorBarber_Shop { int barber, chair, open; cond barber_available, chair_occupied; cond door_open, customer_left; Barber_Shop( ); void get_haircut( ); void get_next_customer( ); void finished_cut( ); }; ICS2004
An Example - Barber Shop Barber_Shop( ) { barber = 0; chair = 0; open = 0; } void get_haircut( ) { while (barber == 0) wait(barber_available); barber = 1; chair += 1; signal(chair_occupied); while (open == 0) wait(door_open); open = 1; signal(customer_left); } ICS2004
An Example - Barber Shop void get_next_customer( ) { barber += 1; signal(barber_available); while (chair == 0) wait(chair_occupied); chair = 1; } void get_haircut( ) { open += 1; signal(door_open); while (open > 0) wait(customer_left); } ICS2004
Channels • The channel construct is a modular and efficient construct for message passing among concurrent tasks • Pipe:one to one • Merger:many to one • Spliter:one to many • Multiplexer:many to many ICS2004
Channels • Communication structures among parallel tasks are morecomprehensive • The specification of communication structures is easier • The implementation of communication structures is moreefficient • The static analysis of communication structures is moreeffective ICS2004
An Example - Consumer-Producer consumer producer consumer spliter consumer ICS2004
An Example - Consumer-Producer task::main( ) { spliterint chan; int i; par { producer( chan ); parfor (i = 0; i < 10; i++) consumer( chan ); } } ICS2004
An Example - Consumer-Producer task::producer(spliterin int chan) { int i; for (i = 0; i < 100; i++) put(chan, i); for (i = 0; i < 10; i++) put(chan, END); } ICS2004
An Example - Consumer-Producer task::consumer(spliterinint chan) { int data; while ((data = get(chan)) != END) process(data); } ICS2004
Data Parallelism • Concurrency • domain – an aggregate of synchronous tasks • Synchronization and communication • domain – variables in global name space ICS2004
An Example – Matrix Multiplication = ICS2004
An Example– Matrix Multiplication domainmatrix_op[16] { int a[16], b[16], c[16]; multiply(distribute inint [16:block][16], distributeinint [16][16:block], distributeoutint [16:block][16]); }; ICS2004
An Example– Matrix Multiplication task::main( ) { int A[16][16], B[16][16], C[16][16]; domain matrix_op m; read_array(A); read_array(B); m.multiply(A, B, C); print_array(C); } ICS2004
An Example– Matrix Multiplication matrix_op::multiply(A, B, C) distribute in int [16:block][16] A; distribute in int [16][16:block] B; distribute out int [16:block][16] C; { int i, j; a := A; b := B; for (i = 0; i < 16; i++) for (c[i] = 0, j = 0; j < 16; j++) c[i] += a[j] * matrix_op[i].b[j]; C := c; } ICS2004
Platforms for the CCC Compiler • PCs and SMPs • Pthread: shared memory + dynamic thread creation • PC clusters and SMP clusters • Millipede: distributed shared memory + dynamic remote thread creation • The similarities between these two classes of machines enable a retargetable compiler implementation for CCC ICS2004
Organization of the CCC Programming System CCC applications CCC compiler CCC runtime library Virtual shared memory machine interface Pthread Millipede SMP SMP cluster ICS2004
The CCC Compiler • Tasks → threads • Monitors → mutex locks, read-write locks, and condition variables • Channels → mutex locks and condition variables • Domains → set of synchronous threads • Synchronous execution → barriers ICS2004
Virtual Shared Memory Machine Interface • Processor management • Thread management • Shared memory allocation • Mutex locks • Read-write locks • Condition variables • Barriers ICS2004
The CCC Runtime Library • The CCC runtime library contains a collection of functions that implements the salient abstractions of CCC on top of the virtual shared memory machine interface ICS2004
Performance Evaluation • SMPs • Hardware:an SMP machine with four CPUs, each CPU is an Intel PentiumII Xeon 450MHz, and cache is 512K • Software:OS is Solaris 5.7 and library is pthread 1.26 • SMP clusters • Hardware:four SMP machines, each of which has two CPUs, each CPU is Intel PentiumIII 500MHz, and cache is 512K • Software:OS is windows 2000 and library is millipede 4.0 • Network:Fast ethernet network 100Mbps ICS2004
Benchmarks • Matrix multiplication (1024 x 1024) • Warshall’s transitive closure (1024 x 1024) • Airshed simulation (5) ICS2004
Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1 cpu) 287.5 295.05 (0.97, 0.97) 264.24 (1.08, 1.08) 250.45 (1.14, 1.14) 275.32 (1.04, 1.04) Pthread (1 cpu) 292.42 (0.98, 0.98) 257.45 (1.12, 1.12) 244.24 (1.17, 1.17) 266.20 (1.08, 1.08) CCC (2 cpu) 152.29 (1.89, 0.94) 110.54 (2.6, 1.3) 98.32 (2.93, 1.46) 124.44 (2.31, 1.16) Pthread (2 cpu) 149.88 (1.91, 0.96) 105.45 (2.72, 1.36) 93.56 (3.07, 1.53) 119.42 (2.41, 1.20) CCC (4 cpu) 76.39 (3.76, 0.94) 69.44 (4.14, 1.03) 73.54 (3.90, 0.98) Pthread (4 cpu) 74.72 (3.85, 0.96) 65.42 (4.39, 1.09) 69.88 (4.11, 1.02) Matrix Multiplication (SMPs) 64.44 (4.46, 1.11) 59.44 (4.83, 1.20) ICS2004 (unit :sec)
Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1mach x 2cpu) 470.44 253.12 (1.85, 0.929) 201.23 (2.33, 1.16) 158.31 (2.97, 1.48) 234.46 (2.0, 1.0) Millipede (1mach x 2cpu) 248.11 (1.89, 0.95) 196.33 (2.39, 1.19) 154.22 (3.05, 1.53) 224.95 (2.09, 1.05) CCC (2mach x 2cpu) 136.34 (3.45, 0.86) 102.25 (4.6, 1.15) 96.25 (4.89, 1.22) 148.25 (3.17, 0.79) Millipede (2mach x 2cpu) 129.33 (3.63, 0.91) 96.52 (4.87, 1.22) 91.45 (5.14, 1.27) 142.45 (3.31, 0.82) CCC (4mach x 2cpu) 87.25 (5.39, 0.67) 62.33 (7.54, 0.94) 80.25 (5.45, 0.73) 102.45 (4.67, 0.58) Millipede (4mach x 2cpu) 78.37 (6.0, 0.75) 54.92 (8.56, 1.07) 75.98 (5.57, 0.75) 95.44 (4.87, 0.61) Matrix Multiplication (SMP clusters) (unit :sec) ICS2004
Sequtial 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1 cpu) 150.32 152.88 (0.98, 0.98) 138.44 (1.08, 1.08) 143.54 (1.05, 1.05) 154.33 (0.97, 0.97) Pthread (1 cpu) 151.25 (0.99, 0.99) 135.45 (1.11, 1.11) 139.21 (1.07, 1.07) 152.44 (0.99, 0.99) CCC (2 cpu) 83.36 (1.80, 0.90) 69.45 (2.16, 1.08) 78.54 (1.91, 0.96) 98.24 (1.53, 0.77) Pthread (2 cpu) 79.32 (1.90, 0.95) 66.85 (2.25, 1.12) 74.24 (2.02, 1.01) 93.44 (1.60, 0.80) CCC (4 cpu) 49.43 (3.04, 0.76) 43.19 (3.48, 0.87) 58.44 (2.57, 0.64) 77.42 (1.94, 0.49) Pthread (4 cpu) 44.14 (3.40, 0.85) 40.89 (3.68, 0.91) 55.23 (2.72, 0.68) 74.21 (2.02, 0.51) Warshall’s Transitive Closure (SMPs) (unit :sec) ICS2004
Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1mach x 2cpu) 305.35 159.24 (1.91, 0.96) 132.81 (2.29, 1.14) 102.19 (2.98, 1.49) 153.90 (1.98, 0.99) Millipade (1mach x 2cpu) 155.34 (1.96, 0.98) 125.91 (2.42, 1.21) 95.29 (3.20, 1.59) 144.53 (2.11, 1.56) CCC (2mach x 2cpu) 100.03 (3.05, 0.76) 82.40 (3.70, 0.92) 148.97 (2.04, 0.52) 202.78 (1.50, 0.38) Millipede (2mach x 2cpu) 88.45 (3.45, 0.86) 75.91 (4.02, 1.00) 140.28 (2.17, 0.54) 189.38 (1.61, 0.41) CCC (4mach x 2cpu) 60.06 (5.08, 0.64) 54.56 (5.59, 0.70) 89.68 (3.40, 0.43) 138.76 (2.20, 0.27) Millipede (4mach x 2cpu) 54.05 (5.65, 0.71) 47.53 (6.42, 0.80) 81.28 (3.75, 0.46) 129.96 (2.36, 0.30) Warshall’s Transitive Closure (SMP clusters) ICS2004 (unit :sec)
Airshed simulation (SMPs) threads (unit :sec) ICS2004
Airshed simulation (SMP clusters) threads (unit :sec) ICS2004
Conclusions • A high-level parallel programming language that uniformly integrates • Both control and data parallelism • Both shared variables and message passing • A modular parallel programming language • A retargetable compiler ICS2004