1 / 32

Special Course on Computer Architecture

Special Course on Computer Architecture. #7 Simulation of Multi-Processors. Hiroki Matsutani and Hideharu Amano. Outline: Simulation of Multi-Processors. Background Recent multi-core and many-core processors Network-on-Chip Shared-memory chip multi-processors Architecture

elani
Télécharger la présentation

Special Course on Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Course on Computer Architecture #7 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Special Course on Computer Architecture

  2. Outline: Simulation of Multi-Processors • Background • Recent multi-core and many-core processors • Network-on-Chip • Shared-memory chip multi-processors • Architecture • Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50min] • Performance evaluation of parallel applications • Performance evaluation of coherence protocols Special Course on Computer Architecture

  3. Multi- and many-core architectures picoChip PC102 picoChip PC205 256 ClearSpeed CSX700 128 Intel 80-core ClearSpeed CSX600 64 TILERA TILE64 Intel SCC 32 Number of PEs (caches are not included) MIT RAW UT TRIPS (OPN) 16 STI Cell BE 8 Sun T1 Sun T2 Fujitsu SPARC64 4 Intel Core, IBM Power7 AMD Opteron 2 2004 2006 2008 2010 2011

  4. Network-on-Chip (NoC) Core Router • Interconnection network to connect many-cores 16-Core Tile Architecture Special Course on Computer Architecture

  5. On-chip router architecture 2) arbitration for the selected output channel 1) selecting an output channel GRANT 3) sending the packet to the output channel Output ports Input ports ARBITER X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 CROSSBAR CORE CORE FIFO Routing, arbitration,&switchtraversal are performed in pipeline manner Special Course on Computer Architecture

  6. Outline: Simulation of Multi-Processors • Background • Recent multi-core and many-core processors • Network-on-Chip • Shared-memory chip multi-processors • Architecture • Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50min] • Performance evaluation of parallel applications • Performance evaluation of coherence protocols Special Course on Computer Architecture

  7. Today’s target architecture • Chip multi-processors (CMPs) • Multiple processors (each has private L1 cache) • Shared L2 cache divided into multiple banks (SNUCA) Processor tile UltraSPARC Cache tile L1cache(I & D) L2cache bank Special Course on Computer Architecture

  8. Today’s target architecture • Chip multi-processors (CMPs) • Multiple processors (each has private L1 cache) • Shared L2 cache divided into multiple banks (SNUCA) • Processors and L2 cache banks are connected via NoC Processor tile UltraSPARC Cache tile L1cache(I & D) L2cache bank On-chip router Special Course on Computer Architecture

  9. Cache coherence is maintained • Write back policy • Cache-write updates the memory when block is evicted • Write invalidate policy • Cache-write invalidates all copies of the other sharers Processor tile Main Memory Cache tile Special Course on Computer Architecture

  10. Cache coherence is maintained • A CPU wants to read a block cached at • The CPU sends a read request to the memory controller • The controller forwards the request to current owner • The owner sends the block to the requestor Processor tile Main Memory Cache tile Special Course on Computer Architecture

  11. Cache coherence: MOESI protocol class • Modified (M) • Modified (i.e., dirty) • Valid in one cache • Shared (S) • Shared by multiple CPUs • Exclusive (E) • Clean • Exists in one cache • Invalid (I) • Owned (O) • May or may not clean • Exists in multiple caches • Owned by one cache • Owner • Responsibility to respond any requests • MOESI protocols • MSI, MOSI, • MESI, MOESI, … Status of each cache block is represented with M/O/E/S/I Special Course on Computer Architecture

  12. Cache coherence protocols • MSI/MOSI directory protocol • E state is not implemented • S-to-M transition always updates the main memory • MESI directory protocol • O state is not implemented; Dirty sharing not allowed • M-to-S transition always updates the main memory • MOESI directory protocol • MOESI token protocol [Martin ISCA03] • There are tokens as many as the number of CPUs • A CPU has one or more tokens  It can read the block • A CPU has all tokens  It can modify (write) the block Special Course on Computer Architecture

  13. MSI Protocol: State transition CpuRd--- CpuWr--- CpuRd--- CpuRd--- M S M S CpuWrBusWr BusRdFlush CpuWr BusWr CpuRd BusRd BusWr Flush BusWr--- I I BusRd--- BusWr--- S-to-M transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

  14. MESI Protocol: State transition CpuRd--- CpuWr--- CpuRd--- M E BusRd FlushOpt M E BusWr Flush CpuWr--- BusRd Flush BusWr FlushOpt CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr S I S CpuRd BusRd(C) I BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- CpuRd--- M-to-S transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

  15. MOESI Protocol: State transition (1/2) CpuRd--- CpuWr--- CpuRd--- CpuWr BusUpgr M E CpuWr--- CpuWrBusWr CpuRd BusRd(!C) CpuWr BusUpgr O S CpuRd BusRd(C) I CpuRd--- CpuRd--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

  16. MOESI Protocol: State transition (2/2) M E BusRd Flush BusRd FlushOpt BusWr Flush BusWr FlushOpt O S I BusRdFlush BusRd FlushOpt BusRd--- BusWr--- BusUpgr--- BusWrFlush BusUpgr--- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

  17. Outline: Simulation of Multi-Processors • Background • Recent multi-core and many-core processors • Network-on-Chip • Shared-memory chip multi-processors • Architecture • Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50min] • Performance evaluation of parallel applications • Performance evaluation of coherence protocols Special Course on Computer Architecture

  18. Full-system simulation: GEMS/Simics • Wind River’s Simics • Commercial detailed processor simulator • Univ. of Wisconsin’s GEMS • Cache, memory, and network module for Simics Processor tile Main Memory UltraSPARC Cache tile L1cache(I & D) L2cache bank On-chip router Special Course on Computer Architecture

  19. Full-system simulation: GEMS/Simics • Today’s simulation target • Solaris 9 OS on eight UltraSPARC processors • Parallel application examples: Pi and Integer sort • Various coherence protocols are supported Processor tile Main Memory UltraSPARC Cache tile L1cache(I & D) L2cache bank On-chip router Special Course on Computer Architecture

  20. Full-system simulation: GEMS/Simics • Simulation target • Solaris 9 OS on eight UltraSPARC processors • Parallel application example: Integer Sort (IS) Solaris 9 is running on 8-core UltraSPARC Processor tile Main Memory UltraSPARC Cache tile A parallel program L1cache(I & D) Compile L2cache bank Execute it with 8-core On-chip router Special Course on Computer Architecture

  21. Parallel application example: OpenMP #include <stdio.h> #include <omp.h> int main() { #pragmaomp parallel printf("hello world from %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); return 0; } Hello from all threads

  22. Parallel application example: OpenMP int main() { inti; double start_time, end_time; start_time = omp_get_wtime(); omp_set_num_threads(num); #pragmaomp parallel shared(A) private(i) { #pragmaomp for for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3.0; } end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time); return 0; }

  23. Parallel application example: OpenMP int main() { inti; double s = 0.0; double start_time, end_time; start_time = omp_get_wtime(); #pragmaomp parallel private(i) reduction(+:s) { #pragmaomp for for (i = 0; i < N; i++) s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3)); } printf("pi = %lf\n", s); end_time = omp_get_wtime(); printf("Elapsed time: %f sec\n", end_time - start_time); }

  24. Outline: Simulation of Multi-Processors • Background • Recent multi-core and many-core processors • Network-on-Chip • Shared-memory chip multi-processors • Architecture • Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50min] • Performance evaluation of parallel applications • Performance evaluation of coherence protocols Special Course on Computer Architecture

  25. The first step: How to use the simulator • Please pick up your account information • Log-in one of ICS cluster machines (id = 01…15) ssh –X <username>@cluster<id>.ics.keio.ac.jp • Copy sample scripts and configuration files cp –r ~matutani/comparch2011/files work cd work Special Course on Computer Architecture

  26. The first step: How to use the simulator • Start Simics ./start_ideal_memory.sh • You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs). Special Course on Computer Architecture

  27. The first step: How to use the simulator • In the target machine, for example, you can check the number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v You will see that there are eight processors Special Course on Computer Architecture

  28. Parallel application: “pi” calculation • You can execute a "pi" calculation program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./pi bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# ./pi bash-2.05# export OMP_NUM_THREADS=1 bash-2.05# ./pi Special Course on Computer Architecture

  29. Parallel application: Integer Sort (IS) • You can execute an Integer Sort (IS) program using eight, four, and one threads. bash-2.05# export OMP_NUM_THREADS=8 bash-2.05# ./IS bash-2.05# export OMP_NUM_THREADS=4 bash-2.05# ./IS bash-2.05# export OMP_NUM_THREADS=1 bash-2.05# ./IS Special Course on Computer Architecture

  30. Exercise 1 • Report the execution time of “pi” using 1, 4, 8, and 16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results. Special Course on Computer Architecture

  31. Coherence protocols: Integer Sort (IS) • The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh • Each simulation takes five to ten minutes. Do not run more than one scripts at the same time! Special Course on Computer Architecture

  32. Exercise 2 • Report the execution time of MSI/MOSI directory, MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19. Special Course on Computer Architecture

More Related