Special Course on Computer Architecture

Special Course on Computer Architecture #7 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Special Course on Computer Architecture

Outline: Simulation of Multi-Processors • Background • Recent multi-core and many-core processors • Network-on-Chip • Shared-memory chip multi-processors • Architecture • Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50min] • Performance evaluation of parallel applications • Performance evaluation of coherence protocols Special Course on Computer Architecture

Multi- and many-core architectures picoChip PC102 picoChip PC205 256 ClearSpeed CSX700 128 Intel 80-core ClearSpeed CSX600 64 TILERA TILE64 Intel SCC 32 Number of PEs (caches are not included) MIT RAW UT TRIPS (OPN) 16 STI Cell BE 8 Sun T1 Sun T2 Fujitsu SPARC64 4 Intel Core, IBM Power7 AMD Opteron 2 2004 2006 2008 2010 2011

Network-on-Chip (NoC) Core Router • Interconnection network to connect many-cores 16-Core Tile Architecture Special Course on Computer Architecture

On-chip router architecture 2) arbitration for the selected output channel 1) selecting an output channel GRANT 3) sending the packet to the output channel Output ports Input ports ARBITER X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 CROSSBAR CORE CORE FIFO Routing, arbitration,&switchtraversal are performed in pipeline manner Special Course on Computer Architecture

Today’s target architecture • Chip multi-processors (CMPs) • Multiple processors (each has private L1 cache) • Shared L2 cache divided into multiple banks (SNUCA) Processor tile UltraSPARC Cache tile L1cache(I & D) L2cache bank Special Course on Computer Architecture

Today’s target architecture • Chip multi-processors (CMPs) • Multiple processors (each has private L1 cache) • Shared L2 cache divided into multiple banks (SNUCA) • Processors and L2 cache banks are connected via NoC Processor tile UltraSPARC Cache tile L1cache(I & D) L2cache bank On-chip router Special Course on Computer Architecture

Cache coherence is maintained • Write back policy • Cache-write updates the memory when block is evicted • Write invalidate policy • Cache-write invalidates all copies of the other sharers Processor tile Main Memory Cache tile Special Course on Computer Architecture

Cache coherence is maintained • A CPU wants to read a block cached at • The CPU sends a read request to the memory controller • The controller forwards the request to current owner • The owner sends the block to the requestor Processor tile Main Memory Cache tile Special Course on Computer Architecture

Cache coherence: MOESI protocol class • Modified (M) • Modified (i.e., dirty) • Valid in one cache • Shared (S) • Shared by multiple CPUs • Exclusive (E) • Clean • Exists in one cache • Invalid (I) • Owned (O) • May or may not clean • Exists in multiple caches • Owned by one cache • Owner • Responsibility to respond any requests • MOESI protocols • MSI, MOSI, • MESI, MOESI, … Status of each cache block is represented with M/O/E/S/I Special Course on Computer Architecture

Cache coherence protocols • MSI/MOSI directory protocol • E state is not implemented • S-to-M transition always updates the main memory • MESI directory protocol • O state is not implemented; Dirty sharing not allowed • M-to-S transition always updates the main memory • MOESI directory protocol • MOESI token protocol [Martin ISCA03] • There are tokens as many as the number of CPUs • A CPU has one or more tokens  It can read the block • A CPU has all tokens  It can modify (write) the block Special Course on Computer Architecture

MSI Protocol: State transition CpuRd--- CpuWr--- CpuRd--- CpuRd--- M S M S CpuWrBusWr BusRdFlush CpuWr BusWr CpuRd BusRd BusWr Flush BusWr--- I I BusRd--- BusWr--- S-to-M transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MESI Protocol: State transition CpuRd--- CpuWr--- CpuRd--- M E BusRd FlushOpt M E BusWr Flush CpuWr--- BusRd Flush BusWr FlushOpt CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr S I S CpuRd BusRd(C) I BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- CpuRd--- M-to-S transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (1/2) CpuRd--- CpuWr--- CpuRd--- CpuWr BusUpgr M E CpuWr--- CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr O S CpuRd BusRd(C) I CpuRd--- CpuRd--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (2/2) M E BusRd Flush BusRd FlushOpt BusWr Flush BusWr FlushOpt O S I BusRdFlush BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- BusWrFlush BusUpgr--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Full-system simulation: GEMS/Simics • Wind River’s Simics • Commercial detailed processor simulator • Univ. of Wisconsin’s GEMS • Cache, memory, and network module for Simics Processor tile Main Memory UltraSPARC Cache tile L1cache(I & D) L2cache bank On-chip router Special Course on Computer Architecture

Full-system simulation: GEMS/Simics • Today’s simulation target • Solaris 9 OS on eight UltraSPARC processors • Parallel application examples: Pi and Integer sort • Various coherence protocols are supported Processor tile Main Memory UltraSPARC Cache tile L1cache(I & D) L2cache bank On-chip router Special Course on Computer Architecture

Full-system simulation: GEMS/Simics • Simulation target • Solaris 9 OS on eight UltraSPARC processors • Parallel application example: Integer Sort (IS) Solaris 9 is running on 8-core UltraSPARC Processor tile Main Memory UltraSPARC Cache tile A parallel program L1cache(I & D) Compile L2cache bank Execute it with 8-core On-chip router Special Course on Computer Architecture

Parallel application example: OpenMP #include <stdio.h> #include <omp.h> int main() { #pragmaomp parallel printf("hello world from %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); return 0; } Hello from all threads

Parallel application example: OpenMP int main() { inti; double start_time, end_time; start_time = omp_get_wtime(); omp_set_num_threads(num); #pragmaomp parallel shared(A) private(i) { #pragmaomp for for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3.0; } end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time); return 0; }

Parallel application example: OpenMP int main() { inti; double s = 0.0; double start_time, end_time; start_time = omp_get_wtime(); #pragmaomp parallel private(i) reduction(+:s) { #pragmaomp for for (i = 0; i < N; i++) s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3)); } printf("pi = %lf\n", s); end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time); }

The first step: How to use the simulator • Please pick up your account information • Log-in one of ICS cluster machines (id = 01…15) ssh –X <username>@cluster<id>.ics.keio.ac.jp • Copy sample scripts and configuration files cp –r ~matutani/comparch2011/files work cd work Special Course on Computer Architecture

The first step: How to use the simulator • Start Simics ./start_ideal_memory.sh • You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs). Special Course on Computer Architecture

The first step: How to use the simulator • In the target machine, for example, you can check the number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v You will see that there are eight processors Special Course on Computer Architecture

Parallel application: “pi” calculation • You can execute a "pi" calculation program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./pi bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# ./pi bash-2.05# export OMP_NUM_THREADS=1 bash-2.05# ./pi Special Course on Computer Architecture

Parallel application: Integer Sort (IS) • You can execute an Integer Sort (IS) program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./IS bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# ./IS bash-2.05# export OMP_NUM_THREADS=1 bash-2.05# ./IS Special Course on Computer Architecture

Exercise 1 • Report the execution time of “pi” using 1, 4, 8, and 16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results. Special Course on Computer Architecture

Coherence protocols: Integer Sort (IS) • The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh • Each simulation takes five to ten minutes. Do not run more than one scripts at the same time! Special Course on Computer Architecture

Exercise 2 • Report the execution time of MSI/MOSI directory, MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19. Special Course on Computer Architecture

Special Course on Computer Architecture

Special Course on Computer Architecture

Presentation Transcript

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

ACAC 2001 Advanced Computer Architecture Course

Computer Architecture

Computer Architecture

Computer Architecture

On the course architecture and course homepage

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Special Course on Computer Architecture 2008

Computer Architecture

Computer Architecture

Computer Architecture