1 / 62

Inter-Processor Parallel Architecture

Inter-Processor Parallel Architecture. Course No Lecture No Term. Outline. Parallel Architectures Symmetric multiprocessor architecture (SMP) Distributed-memory multiprocessor architecture Clusters The Grid The Cloud Multicore architecture Simultaneous Multithreaded architecture

Télécharger la présentation

Inter-Processor Parallel Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inter-Processor Parallel Architecture Course No Lecture No Term

  2. Outline Parallel Architectures • Symmetric multiprocessor architecture (SMP) • Distributed-memory multiprocessor architecture • Clusters • The Grid • The Cloud • Multicore architecture • Simultaneous Multithreaded architecture • Vector Processors • GPUs

  3. Outline • Types of parallelism • Data-level parallelism (DLP) • Task-level parallelism (TLP) • Pipelined parallelism • Issues in Parallelism • Amdahl’s Law • Load balancing • Scalability

  4. What’s a Parallel Computer? Terms widely misused! • Any system with more than one processing unit • Many names • Multiprocessor Systems • Clusters • Supercomputers • Distributed Systems • SMPs • Multicore • May refer to different types of architectures • Not under the realm of parallel computers • OS running multiple processes on a single processor

  5. The 50s and 60s time-share machines mainframes UNIVAC1 IBM 360 image : Wikipedia

  6. Mainframes today IBM still selling mainframe today New line of mainframes code named “T-Rex Banks appear to be the biggest client Billion dollars in revenue IBM Z-enterprise image : The Economist. 2012

  7. The 70s Seymour Cray CDC 7600 CRAY 1 shared-memory processors (SMP) vector machines pipelined architecture image : Wikipedia

  8. The 80s and 90s Cluster Computing Distributed Computing VaxCluster Intel-based Beowulf IBM Power5-based Cluster image : Wikipedia

  9. The 90s and 00s The Grid

  10. Grid Computing • Separate computers interconnected by long-haul networks • e.g., Internet connections • work units farmed out, results sent back • Can make use of idle time on PCs • e.g., • seti@home • folding@home • Don’t need any additional hardware • Writing good software is difficult

  11. 2000s Really based on the same idea as the grid Have all the heavyweights participating Better chance of success!

  12. Parallel Architecture Design Issues Key characteristics for parallel architectures Communication Mechanism Memory Access Scalability • Architectural design choices driven by challenges in writing parallel programs • Three main challenges in writing parallel programs are • Safety • Can’t change the semantics • Account for race conditions • Efficiency • Exploit resources, keep processors busy • Scalability • Ensure that performance grows as you add more nodes. • Some problems are inherently sequential • Can still extract some parallelism from it

  13. Shared-Memory Multiprocessor (SMP) Processor Processor Processor Cache Cache Cache Interconnection Network (Shared Bus) I/O Memory until ~2006 most servers were set up as SMPs • Single address space shared by all processors • akin to a multi-threaded program • Processors communicate through shared variables in memory • Architecture must provide features to co-ordinate access to shared data • synchronization primitives

  14. Process Synchronization in SMP Architecture Need to coordinatethreads working on the same data Assume 2 threads in the program are executing the function in parallel Also assume work performed in each thread is independent But the threads operate on the same data set (e.g., foo) • In the assembly level the increment statement involves at least • one load, • one store • one add int foo = 17; void *thread_func(void *arg) { ... foo = foo + 1; ... }

  15. Process Synchronization in SMP Architecture • Assume the following order of access • T0 loads foo • T1 loads foo • T0 adds to foo • T1 adds to foo • T0 stores foo • T1 stores foo int foo = 17; void *thread_func(void *arg) { ... foo = foo + 1; ... } What is the final value of foo? Problem with concurrent access to shared data

  16. Process Synchronization in SMP Architecture • Assume the following order of access • T0 loads foo • T0 adds to foo • T1 loads foo • T1 adds to foo • T0 stores foo • T1 stores foo int foo = 17; void *thread_func(void *arg) { ... foo = foo + 1; ... } What is the final value of foo? Problem with concurrent access to shared data

  17. Using a mutex for synchronization at any point only one thread is going to execute this code critical section pthread_mutex_lock(mutex); / * code that modifies foo */ foo = foo + 1; pthread_mutex_unlock(mutex); Any thread executing the critical section will perform the load, add and store without any intervening operations on foo To provide support for locking mechanism need atomic operations

  18. Process Synchronization in SMP Architecture • Need to be able to coordinate processes working on the same data • At the program-level can use semaphoresor mutexesto synchronize processes and implement critical sections • Need architectural support to lock shared variables • atomic swap operation on MIPS (lland sc)and SPARC (swp) • Need architectural support to determine which processor gets access to the lock variable • single bus provides arbitration mechanism since the bus is the only path to memory • the processor that gets the bus wins

  19. Shared-Memory Multiprocessor (SMP) Processor Processor Processor Cache Cache Cache Interconnection Network (Shared Bus) I/O Memory What’s a big disadvantage? • Single address space shared by all processors • akin to a multi-threaded program • Processors communicate through shared variables in memory • Architecture must provide features to co-ordinate access to shared data

  20. Types of SMP • SMPs come in two styles • Uniform memory access (UMA)multiprocessors • Any memory access takes the same amount of time • Non-uniform memory access (NUMA) multiprocessors • Memory is divided into banks • Memory latency depends on where the data is located • Programming NUMAs are harder but design is easier • NUMAs can scale to larger sizes and have lower latency to local memory leading to overall improved performance • Most SMPs in use today are NUMA

  21. Processor Processor Processor Cache Cache Cache Memory Memory Memory Interconnection Network Distributed Memory Systems Multiple processors, each with its own address space connected via I/O bus Processors share data by explicitly sending and receiving information (message passing) Coordination is built into message-passing primitives (message send and message receive)

  22. Specialized Interconnection Networks infiniband myrinet • For distributed memory systems speed of communication between processors is critical • I/O bus or Ethernet, although viable solutions don’t provide the necessary necessary performance • latency is important • high throughput is important • Most distributed systems today are implemented with specialized interconnect networks • Infiniband • Myrinet • Quadrics

  23. Clusters • Clusters are a type of distributed memory systems • They are off-the-shelf, whole computers with multiple private address spaces connected using the I/O bus and network switches • lower bandwidth than multiprocessor that use the processor-memory (front side) bus • lower speed network links • more conflicts with I/O traffic • Each node has its own OS, limiting the memory available for applications • Improved system availability and expandability • easier to replace a machine without bringing down the whole system • allows rapid, incremental expansion • Economies-of-scale advantages with respect to costs

  24. Interconnection Networks Bus Ring N-cube (N = 3) 2D Mesh Fully connected On distributed systems processors can be arranged in a variety of ways Typically the more connections you have the better the performance and higher the cost

  25. SMPs vs. Distributed Systems SMP Distributed Need explicit communication Easier to design and program Scalable Can use regular OS Programming API : MPI Administration cost high Communication happens through shared memory Harder to design and program Not scalable Need special OS Programming API : OpenMP Administration cost low

  26. Power Density Heat becoming an unmanageable problem Chart courtesy : Pat Gelsinger, Intel Developer Forum, 2004

  27. The Power Wall • Moore’s law still holds but does not seem to be economically feasible • Power dissipation (and associated costs) too high • Solution • Put multiple simplified cores in the same chip area • Less power dissipation => Less heat => Lower cost

  28. Multicore Chips Shared caches High-speed communication Intel Core 2 Duo Blue Gene/L Tilera64

  29. Intel - Nehalem

  30. AMD - Shanghai

  31. CMP Architectural Considerations cache coherence protocols In a way, each multicore chip is an SMP • memory => cache • processor => core Architectural considerations are same as for SMPs • Scalability • how many cores can we hook up to an L2 cache? • Sharing • how do concurrent threads share data? • through LLC or memory • Communication • how do threads communicate? • semaphores and locks, use cache if possible

  32. Simultaneous Multithreading (SMT) • Many architectures today support multiple HW threads • SMTs use the resources of superscalar to exploit both ILP and thread-level parallelism • Having more instructions to play with gives the scheduler more opportunities in scheduling • No dependence between threads from different programs • Need to rename registers • Intel calls it’s SMT technology hyperthreading • On most machines today, you have SMT on every core • Theoretically, a quad-core machine gives you 8 processors with hyperthreading • Logical cores

  33. Eight fine-grain multithreaded single-issue, in-order Multithreaded Example: Sun’s Niagara (UltraSparcT2) 8-way MT SPARC pipe 8-way MT SPARC pipe 8-way MT SPARC pipe 8-way MT SPARC pipe 8-way MT SPARC pipe 8-way MT SPARC pipe 8-way MT SPARC pipe 8-way MT SPARC pipe I/O shared funct’s Crossbar 8-way banked L2 Memory controllers

  34. Niagara Integer Pipeline Fetch ThrdSel Decode Execute Memory WB RegFilex8 ThrdSelMux ALU Mul Shft Div D$ DTLB Stbufx8 Inst bufx8 I$ ITLB Decode Instr type Thread Select Logic Cache misses Crossbar Interface Traps & interrupts Resource conflicts PC logicx8 ThrdSelMux From MPR, Vol. 18, #9, Sept. 2004 6 stagepipeline, small and power-efficient

  35. SMT Issues • Processor must duplicate the state hardware for each thread • a separate register file, PC, instruction buffer, store buffer for each thread • The caches, TLBs, BHTcan be shared • although the miss rates may increase if they are not sized accordingly • The memory can be shared through virtual memory mechanisms • Hardware must support efficient thread context switching

  36. Vector Processors • A vector processor operates on vector registers • Vector registers can hold multiple data items of the same type • Usually 64, 128 words • Need hardware to support operations on these vector registers • Need at least one regular CPU • Vector ALUs usually pipelined • high pay-off on scientific applications • formed the basis of supercomputers in the 1970’s and early 80’s

  37. Example Vector Machines

  38. Multimedia SIMD Extensions • Most processors today come with vector extensions • The compiler needs to be able generate code for these • Examples • Intel : MMX, SSE • AMD : 3DNow! • IBM, Apple : Altivec (does floating-point as well)

  39. GPUs • Graphics cards have come a long way since the early days of 8-bit graphics • Most machines today come with graphics processing unit that is highly powerful and capable of doing general purpose computation as well • Nvidiais leading the market (GTX, Tesla) • AMD’s competing (ATI) • Intel’s kind of behind • May change with MICs

  40. GPUs in the System Can access each other’s memory image : P&H 4th Ed

  41. NVIDIA Tesla Streaming multiprocessor 8 × Streamingprocessors image : P&H 4th Ed

  42. GPU Architectures • Many processing cores • Cores not as powerful as CPU • Processing is highly data-parallel • Use thread switching to hide memory latency • Less reliance on multi-level caches • Graphics memory is wide and high-bandwidth • Trend toward heterogeneous CPU/GPU systems in HPC • Top 500 • Programming languages/APIs • Compute Unified Device Architecture (CUDA) from NVidia • OpenCL for ATI

  43. The Cell Processor Architecture IBM’s CELL used in PS3 and HPC image : ibm.com

  44. Types of Parallelism • Models of parallelism • Message Passing • Shared-memory • Classification based on decomposition • Data parallelism • Task parallelism • Pipelined parallelism • Classification based on • TLP // thread-level parallelism • ILP // instruction-level parallelism • Other related terms • Massively Parallel • Petascale, Exascale computing • Embarrassingly Parallel

  45. Flynn’s Taxonomy • SISD – single instruction, single data stream • aka uniprocessor • SIMD – single instruction, multiple data streams • single control unit broadcasting operations to multiple datapaths • MISD – multiple instruction, single data • no such machine • MIMD – multiple instructions, multiple data streams • aka multiprocessors (SMPs, clusters)

  46. Data Parallelism D = data D/p D/p D/p D/p D

  47. Data Parallelism D = data typically, same task on different parts of the data spawn D/p D/p D/p D/p synchronize D

  48. Data Distribution Schemes Figure courtesy : Blaise Barney, LLNL

  49. Example : Data Parallel Code in OpenMP !$omp parallel do private(i,j) do j = 1, N do i = 1, M a(i,j) = 17 enddo b(j) = 17 c(j) = 17 d(j) = 17 enddo !$omp end parallel do

  50. Example : Data Parallel Code in OpenMP do jj = 1, N, 16 !$omp parallel do private(i,j) do j = jj, min(jj+16-1,N)!) do i = 1, M a(i,j) = 17 enddo b(j) = 17 c(j) = 17 d(j) = 17 enddo !$omp end parallel do

More Related