Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance ComputingSystems and EnablingPlatforms Marco Vanneschi 4. SharedMemoryParallelArchitectures 4.5. Multithreading, Multiprocessors, and GPUs

Contents • Main features of explicit multithreading architectures • Relationships with ILP • Relationships with multiprocessors and multicores • Relationships with network processors • GPUs MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Basicprinciple • Concurrentlyexecuteinstructionsofdifferentthreadsofcontrolwithin a singlepipelinedprocessor • Notionofthread in thiscontext: • NOT “software thread” as in a multithreaded OS, • Hardware-firmwaresupportedthread: anindependentexecutionsequenceof a (general-purpose or specialized) processor: • A process • A compiler generatedthread • A microinstructionexecutionsequence • A task scheduledto a processorcore in a GPU architecture • Evenan OS thread (e.g. POSIX) • In a multithreadedarchitecture, a threadis the unitofinstructionschedulingfor ILP • Neverthless, multithreading can be a powerfulmechanismformultiprocessingtoo (i.e., parallelism at processlevel in multiprocessorsarchitectures) • (Unfortunately, thereis a lotofconfusionaround the word “thread”, whichisused in severalcontextsoftenwithverydifferentmeanings) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Basicarchitecture • More independentprogramcounters • Taggingmechanism(uniqueidentifiers) todistinguishinstructionsofdifferentthreadswithin the pipeline • Efficientmechanismsforthreadswitching: veryefficientcontextswitching, from zero toveryfew clock cycles • Multiple registersets • (notalwaysstaticallyallocated) IM DM IU Pipelined EU Request Queue Interleave the execution of instructions of different threads in the same pipeline. Try “to fill” the latencies as much as possible. Request Queues FIXED RG IC FIXED RG IC IC FIXED RG IC FLOAT RG FLOAT RG FLOAT RG MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Basic goal: latencyhiding • ILP: latencies are sources of performance degradations because of data dependencies • Logical dependencies induced by “long” arithmetic instructions • Memory accesses caused by cache faults • Idea: interleave instructions of different threads to increase distance • Multithreading and data-flow: similar principles • when implemented in a Von Neumann machine, multithreading requires multiple contexts (program counter, registers, tags), • while in a data-flow machine every instruction contains its “context” (i.e., data values), • the data-flow idea leads to multithtreading when the assembler level is imperative. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Basic goal: latencyhiding • Multiprocessors: remote memory access latency (interconnection network, conflicts) and software lockout • Idea: instead of waiting idle for the remote access conclusion, the processor switches to another thread in order to fill the idle times • Context switching (for threads) is caused by remote memory accesses too • Exploit multiprogramming of a single processor with a much finer grain • Context switching for threads is very fast: multiple contexts (program counter, general registers) are present inside the processor itself (not to be loaded from higher memory levels), and no other “administrative” information has to be saved/restored • Multithreading is NOT under the OS control, instead it is implemented at the firmware level • This is compatible with multiprogramming / multiprocessing: process states still exist (context switching for processes is distinct from context swiching for threads) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Taxonomyofmultithreadedarchitectures [Ungerer, Robic, Silc]: coursereferences Instructions in a given clock cycle can beissuedfrom • a single thread • Interleaved multithreading (IMT) • aninstructionofanotherthreadisfetched and fedinto the execution pipeline (of a scalarprocessor) at each clock cycle • Blocked multithreading (BMT) • the instructionsof a thread are executedsuccessively (in pipeline on a scalarprocessor) untilaneventoccursthatmay cause latency(e.g. remote memoryaccess); thiseventinduces a contextswitch • multiple threads • Simultaneous multithreading (SMT) • instructions are simultaneouslyissuedfrom multiple threadsto the executionunitsof a superscalarprocessor, i.e. superscalarinstructionissueiscombinedwith the multiple-contextapproach MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Single-issueprocessors (scalar: pipelined CPU) Fine grain Coarse grain • Multicore with many “simple” CPUs • Network processors Cray MTA MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Multiple-issueprocessors (superscalar CPU) VLIW IMT VLIW BMT VLIW • b) and/or c): • Blue Gene • SUN UltraSPARC • Intel Xeon Hyperthreading • GPU MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Simultaneous multithreading vsmulticore 4 – threaded 8 – issue SMT processor Multiprocessor with 4 2 – issue processors MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Latencyhiding in multiprocessors • Tryto exploit memoryhierarchiesat best • Localmemories, NUMA • Cache coherence • Communicationprocessor • Interprocesscommunicationlatencyhiding • Additionalsolution: multithreading • Remote memoryaccesslatencyhiding • Fullycompatiblesolutions e.g. KP ismultithreaded in orderto “fill” KP latenciesfor remote memoryaccesses, thus, KP isabletoexecute more communicationsconcurrently: foreachnewcommunicationrequest, a new KP threadisexecuted, and more threads share the KP pipeline, thus, increased KP bandwidth. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Which performance improvements? • Improving CPU (CPUs) efficiency (i.e. utilization, e) • Whataboutservice/completiontimefor a single process ? Apparently, no directadvantage, • but in fact: • communication / calculationoverlappingthrough KP • increased KP bandwidth: thisleadstoanimprovement in service time and completiontimeofparallelprograms, • and: • improvementofILP performance, thus some improvemetofprogramcompletiontime. • Best situation: threadsbelongto the sameprocess, i.e. a processisfurtherparallelizedthroughthreads; • meaningfulimprovement in service/completiontime, providedthat: high parallelismisexploitedbetweenthreadsof the sameprocess. • Herewe can see the convergencebetween multithreading and data-flow: exploit data-flow parallelismbetweenthreadsof the sameprocess (data-flow multithreading). • Researchissue: multithreading optimizingcompilers. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Excessparallelism • The idea of multithreading (BMT) formultiprocessors can haveanotherinterpretation: • Insteadofusing • Nsingle-threadedprocessors use • N/p processors, eachofwhichisp-threaded • (Nothingnewwrt the old idea ofmultiprogramming: exceptthatcontextswitchingis no more a meaningfuloverhead) • Under whichconditionsperformances (e.g. completiontimes) ofsolutions 1 and 2 are comparable? Despite the increasingdiffusionofmultithreadedarchitectures: stillan open researchproblem. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Excessparallelism Rationale: • A data-parallelprogramisdesignedforN virtualprocessors, • where the virtualprocessors are chosenwith the goal ofachieving the maximumparallelismfor a “perfect” architecture (e.g., zero communicationlatency). • ItsimplementationexploitsN/p realprocessors, • wherepispartitionsizeof the realprocessorssolutionwrt the oneof the virtualprocessorssolution. • In severalcases (notalways), the orderofmagnitudeofcompletiontimeisnotincreased: • thisguaranteesthat the program “scales” well. • Conceptually, we can considerthat the realprocessorssolutionexploits “pexcessparallelism” • actually, the realsolutionexploitsN/ p sequentialworkers; • whynotN/pparallelworkers, eachworkerwithpexcessparallelism? i.e. pparallelthreads per worker. • Example: a map. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Excessparallelism From the complexitytheoryofparallelcomputations(PRAM), weknowthat under generalconditionsexcessparallelismdoesn’t increase the orderofmagnitudeofcompletiontime: • Context: data-parallelprogramsexecuted on a sharedmemoryarchitecturewithlogarithmic network, i.e. base latency = O(log N) • “Optimal” parallelalgorithmsexploits the architecture in such a way that: under-loadlatency = O(log N) • This can beachievedalsowithN/pprocessors, eachofwhichwithexcessparallelism p, providedthatpischosenproperlyaccordingto the algorithm and the architecture (notgreaterthanO(log N)). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Multithreading and communication: example A process working on stream: • whiletruedo • receive (…, a); • b = F (a); • send ( …, b) •  Let’s assume zero-copy communication: from the performance evaluation viewpoint, process alternates calculation (latency Tcalc) and communication (latency Tsend). Without no communication processor nor multithreading: Tservice = Tcalc + Tsend In order to achieve (i.e. masking communication latency): Tservice = Tcalc we can exploit parallelism between calculation and communication in a proper way. This can be done by using communication processor and/or multithtreading according to various solutions. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Example: behiaviouralschematizations CPU (IP) and KP Tcalc CPU Tsend KP delegation KP CPU multithreaded (BMT) without KP CPU thread 1 thread switching CPU thread 2 In distributed memory machines, a send invocation can cause a thread switching. If a shared memory machine, the real behaviour is: during a send execution, thread switching occurs several times, i.e. each time a remote memory requests is done. Because the majority of send delay is spent in remote memory accesses, equivalently (from the performance evaluation modeling) we can assume that thread switching occurs, just one time, when a send is invoked (approximate equivalent behaviour). In other words, this is the behaviour of the abstract architecture. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Approximateequivalentbehaviourin multithreaded CPU Real behaviour: interleaving calculation and send execution send local code remote read When remote read is completed, send execution is resumed (a sort of interrupt handling): thread continuation. calc CPU thread 1 calc send local code CPU thread 2 remote write Equivalent behaviour (cost model) : best case approximation CPU thread 1 CPU thread 1 calc CPU thread 2 CPU thread 2 send local code The “interrupted” thread is resumed: thread continuation. remote read CPU thread 1 Tcalc Tsend MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Observations • Simplified cost model: • taking into account of the very high degree of nondeterminism and interleaving, that characterizes multithreaded architectures • for distributed memory architectures, this cost model has a better approximation • Implementation of a thread-suspend / -resume mechanism at the firmware level • in addition to the all the other pipeling/superscalar synchronizations • additional complexity of the hardware-firmware structure MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Tcalc Example: Tcalc ≥ Tsend Tsend CPU (IP) scalar, KP scalar (1-issue multithreaded) CPU KP CPU 2-issue multithreaded, no KP CPU thread 1 CPU thread 2 Equivalent service time (Tcalc). Equivalent parallelism degree per node: two real-parallelism threads, on the same node, correspond to two non-multithreaded nodes. In fact, many hardware resources are duplicated in a 2-issue multithreaded node. In principle, chip area is equivalent (hardware-complexity of the same order of magnitude). However, in practice … (see slide + 2). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Tcalc Example: Tcalc < Tsend Tsend = 4 Tcalc CPU (IP) scalar, KP 4-issue multithreaded Equivalent service time (Tcalc). Equivalent parallelism degree per node: provided that memory bandwidth > (block) accesses per memory clock cycle. This is the upper bound to the parallelism per node that can be exploited. CPU KP thread 1 KP thread 2 KP thread 3 KP thread 4 CPU 5-issue multithreaded, no KP MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Observation: KP or not KP ? The sevicetime and the total parallelismdegree per nodebeingequal, 1) solutionwithCPU (IP) + p-threaded KP has the samehardware-complexityof 2) solutionwith(1 + p)-threaded CPU, without KP. However, in termsofrealcost, solution 1) ischeaper, e.g. ithas a simplerhardware-firmwarestructure (lessinter-threadsynchronization in CPU pipelining, lowersuspend/resumenesting), thusithas a lowerpowerdissipation. Moreover, solution 1 can beseenas just anotherrationaleforheterogeneousmulticore (main CPU + pcores). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Observation: parallelcommunications and parallelprogramoptimization • Multithreaded CPU, or CPU + multithreaded KP, is a solutionto eliminate / reduce potentialbottlenecks in parallelprograms, providedthat the memorybandwidthisadequate. • Example: a farmprogramwhereinterarrivaltime TA < Tsend • Emittercouldbe a bottleneck (Temitter = Tsend) • Example: a data-parallelprogramwhen the Scatterfunctionalitycouldbe a bottleneck • Scatter service time > TA • In bothcases, bottleneckspreventto exploit the idealparallelismsolution (Tcalc / TAworkers). • In bothcases, parallelizationofcommunicationseliminates / reducesbottlenecks. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Observation: importanceofadvancedmechanismsforinterprocesscommunication • Multithreading = parallelism exploitation and management (context switching) at firmware level • Efficient mechanisms for interprocess communication are needed • User level • Zero-copy • Example: multiple processing on target variables are allowed by the zero-copy communication VTG Multiple threads fired by multiple messages VTG VTG VTG channel MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Network processors and multithreading • Network processors apply multithreading to bridge latencies during (remote) memory accesses • Blocked multithreading (BMT) • Multithreading applied to cores that perform the data traffic handling • Hard real-time events (i.e., deadline soluld never be missed) • Specific instruction scheduling during multithreaded execution • Examples: • Intel IXP • IBM PowerNP MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

IBM Wire-SpeedProcessor (WSP) • Heterogenousarchitecture • 16 general-purposemultithreadedcores: PowerPC, 2.3 GHz • SMT, 4 simultaneousthreads/core • 16 Kb L1 instruction cache, 16 Kb L1 data cache (8-way set associative), 64-byte cache blocks • MMU: 512-entry, variablepagesize • 4 L2 caches (2 MB), each L2 cache sharedby 4 cores • Domain-specificco-processors (accelerators) • targetedtowardnetworkingapplications: packet processing, security, pattern matching, compression, XML • custom hardware-firmwarecomponentsforoptimizations • networkinginterconnect: four 10-Gb/s links • Internal interconnection structure: partialcrossbar • similarto a 4-ring structure, 16-byte links MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

WSP IBM Journal of Res & Dev, Jan/Feb 2010, pp. 3:1 – 3:11. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

WSP MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

WSP Advancedfeaturesforprogrammability + portability + performance: • Uniformaddressability: uniformvirtualaddressspace • Every CPU core, accelerator and I/O unithas a separate MMU • Sharedmemory: NUMA architecture, includingaccelerators and I/O units (heterogeneous NUMA) • Coherent (snooping) and noncoherentcachingsupport, alsoforaccelerators and I/O • Result: accelerators and I/O are notspecialentitiestobecontrolledthroughspecializedmechanisms, insteadtheyexploit the samemechanismsof CPU cores • full process-virtualizationofco-processors and I/O • Specialinstructionsforlocking and core-coprocessorsynchronization • Load and Reserve, StoreConditional • InitiateCoprocessor • Specialinstructionsforthreadsynchronization • wait, resume MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

GPUs • Currently, another application of the multithreading paradigm is present in GPUs (Graphics Processing Units) and their attempt to become “general” machines • GPUs are SIMD machines • In this context, threads are execution instances of data-parallel tasks (data-parallel workers) • Both SMT and multiprocessor + SMT paradigms are applied MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

SIMD architecture • SIMD (Single InstructionStream Multiple Data Stream) • Data-parallel(DP) paradigm at the firmware-assemblerlevel DP vector instructions (map, stencil) Instruction issue: multicast Data distribution (scatter, multicast) and collection (gather, reduce) Instruction & Data Memory Instruction Unit Execution Unit Execution Unit Execution Unit Execution Unit DP processor cores (workers) e.g. 2-3 dimension mesh (k-ary n-cube) Local Memory Local Memory Local Memory . . . Local Memory Interconnection structure • Example: IU controls the partitioningof a floatvectorinto the localmemories (scatter), and issues a requestof “vector_float_addition” toallEUs • Pipelined processing IU-EU, pipelinedEUs • Extension: partitioningofEU intodisjointsubsetsfor DP multiprocessing (MIMD + SIMD) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

SIMD: parallel, high-performanceco-processor Host system Cache Memory Cache Memory SIMD co-processor Main Memory DMA Bus ... CPU Memory Management Unit I/O Bus • SIMD cannot be general-purpose. • I/O bandwidth and latency for data transfer between Host and SIMD co-processor could be critical. • Challenge: proper utilization of central processors and peripheral SIMD co-processors for designing high-performance parallel programs . . . Processor Interrupt Arbiter MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

GPU parallelprograms • From specialized coprocessors for real-time, high-quality 3D graphics rendering (shader), to programmable data-parallel coprocessors • Generality vs performance? Programmability ? • Stream-based SIMD Computing: replication of stream tasks (shader code) and partitioning of data domain onto processor cores (EU) • Thread: execution instance of a stream task scheduled to a processor core (EU) for execution • NOT to be confused with a software thread in multithreaded OS • same meaning of thread in multithreaded architectures. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Exampleof GPU: AMD MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

AMD GPU RV770 • EU is organized into 10 partitions • Each EU partition contains 16 EUs • Each EU is a 5-issue SMT multithreaded superscalar (VLIW) pipelined processor • Ideal exploitation: a 800 processor machine • Internal EU operators include scalar arithmetic operations, as well as float operations: sin, cos, logarithm, sqrt, etc RV870 • 20 EU partitions MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Nvidia GPU: GeForce GTX - Fermi • MIMD multiprocessor of 10 SIMT processors • Each SIMT is a SIMD architecture with 3 - 16 EU partitions, 8 EUs (CUDA) per partition MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

GPU: programmingmodel • Current tools (e.g. CUDA) are too elementary and low-level • Serious problems of programmability • Programmer is in charge of managing • data-parallelism at the architectural level • memory and I/O • multithreading • communication • load balancing • Trend (?) • High level programming model (structured parallel programming ?) with structured and/or compiler-based cooperation between Host (possibly MIMD) and SIMD coprocessors. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking

Presentation Transcript