Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance ComputingSystems and EnablingPlatforms Marco Vanneschi • InstructionLevelParallelism • 2.1. Pipelined CPU • 2.2. More parallelism, data-flow model

Improving performance throughparallelism Sequential Computer • The sequentialprogrammingassumption • User’s applications exploit the sequentialparadigm: technology pull or technologypush? • Sequentialcooperation P – MMU – Cache – MainMemory – … • Request-responsecooperation: low efficiency, limited performance • Caching: performance improvementthroughlatencyreduction(memoryaccesstime), evenwithsequentialcooperation • Further, big improvement: increasing CPU bandwidth, throughCPU-internalparallelism • InstructionLevelParallelism (ILP): severalinstructionsexecuted in parallel • Parallelismishiddento the programmer • The sequentialprogrammingassumptionstillholds • Parallelismisexploited at the firmwarelevel(parallelizationof the firmwareinterpreterofassemblermachine) • Compiler optimizationsof the sequentialprogram, exploiting the firmwareparallelism at best MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

ILP: whichparallelismparadigm? All forms are feasible: Pipelined implementation of the firmware interpreter Replication of some parts of the firmware interpreter Data-flow ordering of instructions Vectorized instructions Vectorized implementation of some parts of the firmware interpreter Parallelismparadigms / parallelismforms • Streamparallelism • Pipeline • Farm • Data-flow • … • Data parallelism • Map • Stencil • … • Stream + data parallelism • Map / stencil operating on streams • … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipeline paradigm in general Module operating on stream M • x : y = F (x) = F2 (F1 (F0 (x) ) ) • Tseq = TF = TF0 + TF1 + TF2 Stream of x Stream of y F0 F1 F2 Parallelism degree: n = 3 M0 M1 M2 Stream of x Stream of y T1 = TF1 + Tcom T0 = TF0 + Tcom T2 = TF2 + Tcom Service Time: Tpipe = max (T0, T1, …, Tn-1) parallelismdegreenTpipe-id = Tseq / n Efficiency: e = Tpipe-id/Tpipe = Tseq / (n max (T0, T1, …, Tn-1)) ifTF0 = TF1 = … = TFn-1 = t : Tpipe = t + Tcombalanced pipeline ifTcom = 0 : Tpipe = Tpipe-id = Tseq / n fullyoverlappedcommunications Forbalanced, fullyoverlapped pipeline structures: processing bandwidth: Bpipe = Bpipe-id = n Bseq efficiency: e = 1 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipeline latency Module operating on stream M • x : y = F (x) = F2 (F1 (F0 (x) ) ) • latencyLseq =Tseq = TF = TF0 + TF1 + TF2 Stream of x Stream of y F0 F1 F2 M0 M1 M2 Stream of x Stream of y L0 = TF0 + Lcom L1 = TF1 + Lcom L2 = TF2 + Lcom Latency: ≥ Lseq Bandwidth improvement at the expense of increased latency wrt the sequential implementation. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipeline completiontime Steady state period m = stream length (= 2) n = parallelism degree (= 2) T (n)= service time with parallelism degree n Filling transient period Emptying transient period If m >> n : The fundamental relation between completion time and service time, valid for “long streams” (e.g. instruction sequence of a program …) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipeline withinternal state • Computation with internal state: for each stream element, the output value depends also on an internal state variable, and the internal state is updated. • Partitioned internal state: if disjoint partitions of the state variable can be recognized and they are encapsulated into distinct stages and each stage operates on its own state partition only, then the presence of internal state has no impact on service time. • However, if state partitions are related each other, i.e. if there is a STATE CONSISTENCY PROBLEM, then some degradation of the service time exists. Mi Mi+1 Mi-1 … … Si-1 Si Si+1 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipelining and loopunfolding k-unfolding M :: int A[m]; ints, s0; { < receive s0 from input_stream >; s = s0 ; for (i = 0; i < m; i++) s = F (s, A[i]); < send s onto output_stream > } k times M :: int A[m]; int s; { < receive s0 from input_stream >; s = s0; for (i = 0; i < m/k; i++) s = F (s, A[i]); for (i = m/k; i < 2m/k; i++) s = F (s, A[i]); … for (i = (k-1)m/k ; i < m; i++) s = F (s, A[i]); < send s onto output_stream > } k-stage balanced pipeline with A partitioned (k partitions) • Formally, thistransformationimplements a formofstreamdata-parallelismwith stencil: • simple stencil (linearchain), • optimal service time, • butincreasedlatencycomparedtonon-pipelineddata-parallelstructures. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Fromsequential CPU … M Memory Interface (MINF) I/O MMU C CPU chip P Interrupt arbiter . . . MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

… topipelined CPU Parallelizationof the firmwareinterpreter while (true) do  instructionfetch (IC); instructiondecoding; (possible) operandfetch; execution; (possible) resultwriting; IC update; < interrupt handling: firmwarephase >  Generate stream • instructionstream Recognize pipeline stages • correspondingtointerpreterphases Loadbalanceofstages • Pipeline withinternal state • problemsof state consistency • “impure” pipeline • impact on performance MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

3 3 3 2 2 2 1 1 1 BasicprincipleofPipelined CPU: simplifiedview M becomes Instruction Memory (IM) InstructionUnit (IU) Data Memory (DM) ExecutionUnit (EU) P t Instruction stream generation: consecutive instruction addresses Instruction decode, data addresses generation Data accesses Execution phase IM IU DM 2 3 1 EU In principle, a continuous stream of instructions feeds the pipelined execution of the firmware interpreter; each pipeline stage corresponds to an interpreter phase. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipelined CPU: anexample M Bus I/O • IM: Instruction Memory • MMUI: Instruction Memory MMU • CI: Instruction Cache • TAB-CI: Instruction Cache Relocation • DM: DataMemory • MMUD: Data Memory MMU • CD: Data Cache • TAB-CD: Data Cache Relocation • IU: Instruction prepation Unit; possibly parallel • EU: instruction Execution Unit ; possibly parallel/pipelined • RG: General Registers, primary copy • RG1: General Registers, secondary copy • IC: program counter, primary copy • IC1: program counter, secondary copy MINF IC1 MMUI MMUD TAB-CI TAB-CD IM DM CI CD IU EU IC RG1 RG Int. Arb. CPU chip (primary cache only is shown for simplicity) . . . MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Maintasksof the CPU units (simplified) • IM : generates the instructionstream at consecutive addresses (IC1); if a newvalue IC1 isreceivedby IU, re-starts the stream generation startingfromthisvalue • IM service time = 1t + Tcom = 2t (Ttr= 0) • IU : receivesinstructionfrom IM; ifvalidthen, accordingtoinstructionclass: • ifarithmeticinstruction: sendsinstructionto EU • if LOAD: sendsinstructionto EU, computesoperandaddresson updatedregisters, sendsreadingrequestto DM • if STORE: sendsinstructionto EU, computesoperandaddresson updatedregisters, sendswritingrequestto DM (address and data) • if IF: verifiesbranchcondition on generalregisters; iftruethenupdates IC with the target address and sends IC valueto IM, otherwiseincrements IC • if GOTO: updates IC with the target address and sends IC valueto IM • IU service time = 1t + Tcom = 2t (Ttr= 0) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Maintasksof the CPU units (simplified) • DM : receivesrequestfrom IU and executesit; if a readingoperationthensends the readvalueto EU • DM service time = 1t + Tcom = 2t (Ttr= 0) • EU : receivesrequestfrom IU and executesit: • ifarithmeticinstruction: executes the correspondingoperation, stores the resultsinto the destinationregister, and sendsitto IU • if LOAD: waitforoperandfrom DM, storesitinto the destinationregister, and sendsitto IU. • EU service time = 1t + Tcom = 2t (Ttr= 0); latencymaybe1t or greater MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

AMD Opteron X4 (Barcelona) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipeline abstractarchitecture • The CPU architecture “asseen” by the compiler • Itcaptures the essentialfeaturesfordefininganeffectivecostmodelof the CPU • First assumption: Pure Risc(e.g.D-RISC) • allstageshave the same service time: wellbalanced: • t = Tid • Ttr= 0: ift ≥ 2tTcom = 0 • forvery fine grainoperations, allcommunications are fullyoverlappedtointernalcalculations IM DM IC1 IU EU IC RG1 RG The compiler can easilysimulate the programexecution on thisabstractarchitecture, or ananalyticalcostmodel can bederived. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

3 1 2 1 1 Example 1 1. LOAD R1, R0, R4 2. ADD R4, R5, R6 3. … Execution simulation: to determine service time T and efficiencye t 2 t Instruction sequences of this kind are able to fully exploit the pipelined CPU. In fact, in this case CPU behaves as a “pure” pipeline. No performance degradation. “Skipping” DM stage has no impact on service time, in this example. IM 2 3 IU asynchronous com. DM 1 2 EU MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Peak performance • For t ≥ 2tTcom = 0 • Assume: t = 2t forall processing units • including EU: arithmeticoperationshaveservice time 2t • floatingpointarithmetic: pipeline implementation (e.g., 4 or 8 stages …) • forintegerarithmetic: latency 2t (e.g., hardware multiply/divide) • however, thislatencydoesnotholdforfloatingpointarithmetic, e.g. 4 or 8 timesgreater • CPU Service time: T = Tid = t • Performance: Pid= 1/Tid = 1/2t = fclock/2 instructions/sec • e.g. fclock = 4 GH2, Pid= 2 GIPS • Meaning: aninstructionisfetchedevery 2t • i.e., a newinstructionisprocessedevery 2t • from the evaluationviewpoint, the burdenbetweenassembler and firmwareisvanishing! • marginaladvantageoffirmwareimplementation: microinstructionparallelism • Furtherimprovements (e.g., superscalar) willleadto Pid= w/t = w fclock , w ≥ 1 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Performance degradations • For some programs, the CPU isnotableto operate as a “pure” pipeline • Consistencyproblems on replicatedcopies: • RG in EU, RG1 in IU associatedsynchronizationmechanism • IC in IU, IC1 in IM associatedsynchronizationmechanism • In ordertoguaranteeconsistencythroughsynchronization, some “bubbles” are introduced in the pipeline flow • increased service time • Consistencyof RG, RG1: LOGICAL DEPENDENCIES ON GENERAL REGISTERS (“data hazards”) • Consistencyof IC, IC1: BRANCH DEGRADATIONS (“controlhazards or branchhazards”) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

t 4 t 2 L 3 1 1 1 4 4 IM 2 3 L IU DM 1 2 EU Example 2: branchdegradation 1. LOAD R1, R0, R4 2. ADD R4, R5, R6 3. GOTO L 4. … “bubble”: IM performs a wastedfetch (instruction 4), instruction 4 mustbediscardedby IU, aboutonetimeintervaltiswasted and itpropagatesthrough the pipeline. Abstract architecture: exactly one time interval t is wasted (approximate evaluation of the concrete architecture) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

t 5 t IM 3 2 IU 3 DM 3 1 1 1 2 1 2 EU Example 3: logicaldependency 1. LOAD R1, R0, R4 2. ADD R4, R5, R6 3. STORE R2, R0, R6 Logical dependency, induced by instruction 2 on instruction 3; distance k =1 Instruction 3: when processed by IU, it needs the updated value of RG[6] contents. Thus, IU must wait for the completion of execution of instruction 2 in EU. In this case, two time intervals t are wasted (approximate view of the abstract architecture). The impact of execution latency is significant (in this case, DM is involved because of the LOAD instruction) . If an arithmetic instruction were in place of LOAD: the bubble skinks to one time interval t. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipelined D-RISC CPU: implementationissues M Bus I/O Assume a D-RISC machine Verify the feasibility of the abstract architecture characteristic: all stages have the same service time = 2t. MINF IC1 MMUI MMUD TAB-CI TAB-CD IM DM RG-RG1 consistency: integer semaphore mechanism associated to RG1. IC-IC1 consistency: unique identifier associated to instructions from IM, in order to accept or to discard them in IU. CI CD IU EU IC RG1 RG Int. Arb. CPU chip . . . MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

RG1 consistency • Each General Register RG1[i] has associated a non-negative integer semaphore S[i] • For each arithmetic/LOAD instruction having RG[i] as destination, IU incrementsS[i] • Each time RG1[i] is updated with a value sent from EU, IU decrements S[i] • For each instruction that needs RG[i] to be read by IU (address components for LOAD, address components and source value for STORE, condition applied to registers for IF, address in register for GOTO), IU waits until condition (S[i] = 0) holds MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

IC1 consistency • Each time IU executes a branch or jump, IC is updated with the new target address, and this value is sent to IM • As soon as it is received by IU, IM writes the new value into IC1 • Every instruction sent to IU is accompanied by the IC1 value • For each received instruction, IU compares the value of IC with the received value of IC1: the instruction is valid if they are equal, otherwise the instruction is discarded. • Notice that IC (IC1) acts as a unique identifier. Alternatively, a shorter identifier can be generated by IU and sent to IM. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Detailedtasksof the CPU units • IM • MMUI: translates IC1 into physical address IA of instruction, sends (IA, IC1) to TAB-CI, and increments IC1 • when a new message is received from IU, MMUI updates IC1 • TAB-CI : translates IA into cache address CIA of instruction, or generates a cache fault condition MISS), and sends (CIA, IC1, MISS) to CI • CI : if MISS is false, reads instruction INSTR at address CIA and sends (INSTR, IC1) to IU; otherwise performs the block transfer, then reads instruction INSTR at address CIA and sends (INSTR, IC1) to IU MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Detailedtasksof the CPU units • IU • whenreceiving(regaddr, val) from EU writesvalinto RG[regaddr] and decrements S[regaddr] or • whenreceiving (INSTR, IC1), if IC = IC1 the instructionisvalid and IU goes on, otherwisediscards the instruction • iftype = arithmetic(COP, Ra, Rb, Rc), sendsinstructionto EU, increments S[c] • iftype = LOAD (COP, Ra, Rb, Rc), waitsforupdatedvaluesof RG[a] and RG[b], sendsrequest (read, RG[a] + RG[b]) to DM, increments S[c] • iftype = STORE (COP, Ra, Rb, Rc), waitsforupdatedvaluesof RG[a], RG[b] and RG[c], sendsrequest (read, RG[a] + RG[b], RG[c]) to DM • iftype = IF (COP, Ra, Rb, OFFSET), waitsforupdatedvaluesof RG[a] and RG[b], ifbranchconditionistruethen IC = IC + OFFSET and sends the new IC to IM, otherwise IC = IC +1 • iftype = GOTO (COP, OFFSET), IC = IC + OFFSET and sends the new IC to IM • iftype = GOTO (COP, Ra), waitsforupdatedvalueof RG[a], IC = RG[a] and sends the new IC to IM MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Detailedtasksof the CPU units • DM • MMUD: receives (op, log_addr, data) from IU, translateslog_addrintophys_addr, sends (op, phys_addr, data) toTAB_CD • TAB-CD : translatesphys_addrinto cache address CDA, or generates a cache fault condition MISS), and sends (op, CDA, data, MISS) toCD • CD: if MISS is false, thenaccordingtoop: • reads VAL at address CDA and sends VAL to EU; • otherwisewritesdata at address CDA; if MISS istrue: performs block transfer; completes the requestedoperationasbefore • EU • receivesinstruction (COP, Ra, Rb, Rc) from IU • iftype = arithmetic(COP, Ra, Rb, Rc), executes RG[c] = COP(RG[a], RG[b]), sends the new RG[c] to IU • iftype = LOAD (COP, Ra, Rb, Rc), waits VAL from DM, RG[c] = VAL, sends VAL to IU MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Costmodel • Tobeevaluated on the abstractarchitecture: goodapproximation • Branchdegradation: l = branchprobability T = l 2t + (1 – l) t = (1 + l) t • Logicaldependenciesdegradation: T = (1 + l) t + D D = averagedelaytime in IU processing • Efficiency: MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Example • 1. LOAD R1, R0, R4 • 2. ADD R4, R5, R6 • 3. GOTO L • … • = 0 • = 1/3 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Logicaldependenciesdegradation • Formally, D is the response time RQ in a client-server system • Server: DM-EU subsystem • A logical queue of instructions is formed at the input of DM-EU server • Clients: logical instances of the IM-IU subsystem, corresponding to queued instructions sent to DM-EU • Let d = average probability of logical dependencies D= d RQ • How to evaluate RQ? MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

C1 S . . . Q Cn Client – server systems in general (at anylevel) • Assume a request-responsebehaviour: • a client sends a requestto the server, • thenwaitsfor the responsefrom the server. • The service timeofeach client isdelayedby the RESPONSE TIME of the server. • Modeling and evaluationas a Queueing System. Ci:: initialize x … while (true) do {< send x to S >; < receive y from S >; x = G (y, …) } S:: while (true) do { < receive x from C C1, …, Cn >; y = F (x, …) ; < send y to C > } MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Client-serverqueueingmodeling Tcl: average service time of generic client TG: ideal average service time of generic client RQ: average response time of server r: server queue utilization factor TS: average service time of server LS: average latency time of server sS: other service time distribution parameters (e.g., variance of the service time of the server) TA: average interarrival time to server n: number of clients (assumed identical) System of equations to be solved through analytical or numerical methods. One and only one real solution exists satisfying the constraint: r < 1. This methodology will be applied several times during the course. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Typicalresults: server utilizationdegree(= r) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Typicalresults: client efficiency(= TG/Tcl) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Parallel server: the impact of service time and oflatency • A more significant way to express the Response Time, especially for a parallel server: RQ = WQ (r, …) + LS • WQ: average Waiting Time in queue (queue of clients requests): WQ depends on the utilization factor r, thus on the service time only (noton latency) • LSis the latency of server • Separation of the impact of service time and of latency • All the parallelism paradigms aim to reduce the service time • Some parallelism paradigms are ableto reduce the latency too • Data-parallel, data-flow, … while other paradigms increase the latency • Pipeline, farm, … • Overall effect: in which cases does the service time (latency) impact dominates on the latency (service time) impact ? • For each server-parallelization problem, the relative impact of service time and latency must be carefully studied in order to find a satisfactory trade-off. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Pipeline CPU: approximateevaluationofD • For a D-RISC machine • Evaluation on the abstractarchitecture • Let: • k = distanceof a logicaldependency • dk= probabilityof a logicaldependencywithdistancek • NQ= averagenumberofinstructionswaiting in DM-EU subsystem • It can beshownthat: wheresummationisextendedtonon-negativetermsonly. NQ = 2 if the instructioninducing the logicaldependencyis the last instructionof a sequencecontainingallEU-executableinstructionsand at leastone LOAD, otherwiseNQ = 1 . MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

D formula for D-RISC: proof • D = d1D1 + d2D2 + … extendedtonon-negativetermsonly • dk = probabilityoflogicaldependencywithdistancek • Dk = RQk= WQk + LS • WQk = averagenumberofqueuedinstructionsthat (withprobabilitydk) precede the instructioninducing the logicaldependency • LS = Latencyof EU in D-RISC; withhardware-implementedarithmeticoperations on integers LS = t • Dk = WQk + t • Problem: find a generalexpressionforWQk • The numberofdistinctsituationsthat can occur in a D-RISC abstractarchitectureisverylimited • Thus, the proof can bedonebyenumeration. • Letus individuate the distinctsituations. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

D formula for D-RISC: proof • Case A: • ADD …, …, R1 • STORE …, …, R1 • Case B: • ADD …, …, R1 • SUB …, …, R2 • STORE …, …, R1 • Case D: • LOAD …, …, … • ADD …, …, R1 • SUB …, …, R2 • STORE …, …, R1 • Case C: • LOAD …, …, … • ADD …, …, R1 • STORE …, …, R1 distance k = 1 probability d1 distance k = 2 probability d2 distance k = 2 probability d2 distance k = 1 probability d1 WQ = 1t: when 3 is in IU, 1 is in EU (current service), 2 is waiting in queue WQ = 1t: when 3 is in IU, 1 has been already executed WQ = 0: when 4 is in IU, 2 is in EU (current service) WQ = 0: when 2 is in IU, 1 is in EU (current service) D= d2 (WQ + Ls) = 0 D = d2 (WQ + Ls) = d2 t D = d1 (WQ + Ls) = d1 t D = d1 (WQ + Ls) = 2 d1 t In all other cases D = 0, in particular for k > 2. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

D formula for D-RISC: proof • WQk = numberofqueuedinstructionspreceding the instructionthatinduces the logicaldependency • MaximumvalueofWQk = 1t: • the instructioncurrentlyexecuted in EU hasbeendelayedby a previous LOAD, and the instructioninducing the logicaldependencyisstill in queue • Otherwise: WQk = 0 or WQk = - 1t • WQk = 0: the instructioncurrentlyexecuted in EU induces the logicaldependency, • WQk = 0: the instructioncurrentlyexecuted in EU hasbeendelayedby a previous LOAD, anditisthe instructioninducing the logicaldependency • WQk = -1t: the instructioninducing the logicaldependencyhasbeenalreadyexecuted • Tobeproventhat: WQk = (NQk – k)t • NQk = 2 if the instructioninducing the logicaldependencyis the last instructionof a sequencecontainingallEU-executableinstructionsand at leastone LOAD, • otherwiseNQk = 1 . MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

D formula for D-RISC: proof • Case A: • ADD …, …, R1 • STORE …, …, R1 • Case D: • LOAD …, …, … • ADD …, …, R1 • SUB …, …, R2 • STORE …, …, R1 • Case C: • LOAD …, …, … • ADD …, …, R1 • STORE …, …, R1 • Case B: • ADD …, …, R1 • SUB …, …, R2 • STORE …, …, R1 distance k = 2 probability d2 distance k = 2 probability d2 distance k = 1 probability d1 distance k = 1 probability d1 WQ = 0: when 4 is in IU, 2 is in EU (current service) WQ = 1t: when 3 is in IU, 1 is in EU (current service), 2 is waiting in queue WQ = 1t: when 3 is in IU, 1 has been already executed WQ = 0: when 2 is in IU, 1 is in EU (current service) D = d2 (WQ + Ls) = 0 D = d1 (WQ + Ls) = 2 d1 t D = d2 (WQ + Ls) = d2 t D = d1 (WQ + Ls) = d1 t NQ = 1,WQ = (NQ–k)t = 0 NQ = 1,WQ = (NQ–k)t = 1t NQ = 2,WQ = (NQ–k)t = 1t NQ = 2,WQ = (NQ–k)t = 0 In all other cases D = 0 : k > 2, or k = 2 and NQ = 1 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Example 1. LOAD R1, R0, R4 2. ADD R4, R5, R6 • STORE R2, R0, R6 l = 0 Onelogicaldependencyinducedbyinstruction 2 on instruction 3: distance k = 1, probability d1 = 1/3, NQ = 2 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Example 1. L : LOAD R1, R0, R4 2. LOAD R2, R0, R5 3. ADD R4, R5, R4 4. STORE R3, R0, R4 5. INCR R0 6. IF < R0, R6, L 7. END Two logical dependencies with the same distance (= 1): the one with NQ = 2, the other with NQ = 1 MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Note: intermixed/nestedlogicaldependencies Logical dependency induced by 5 on 7 has no effect. Logical dependency induced by 4 on 7 has no effect. 5. 4. 6. 5. 7. 6. 7. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Cache faults • All the performance evaluations must be corrected taking into account also the effects of cache faults (Instruction Cache, Data Cache) on completion time. • The added penalty for block transfer (data) • Worst-case: exactly the same penalty studied for sequential CPUs (See Cache Prerequisites), • Best-case: no penalty - fully overlapped to instruction pipelining MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Compiler optimizations • Minimize the impact of branch degradations: • DELAYED BRANCH = try to fill the branch bubbles with useful instructions • Annotated instructions • Minimize the impact of logical dependencies degradations: • CODE MOTION = try to increase the distance of logical dependencies • Preserve program semantics, i.e. transformations that satisfy the Bernstein conditions for correct parallelization. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Bernstein conditions Given the sequential computation: R1 = f1(D1) ; R2 = f2(D2) it can be transformed into the equivalentparallel computation: R1 = f1(D1) R2 = f2(D2) if all the following conditions hold: R1 D2 = (true dependency) R1 R2 =  (antidependency) D1 R2 = (output dependency) Notice: for the parallelism inside a microinstruction (seeFirmwarePrerequisites), only the first and secondconditions are sufficient (synchronousmodelofcomputation). E.g. A + B  C ; D + E  A isequivalentto: A + B  C , D + E  A MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Delayedbranch: example 1. L : LOAD R1, R0, R4 2. LOAD R2, R0, R5 3. ADD R4, R5, R4 4. STORE R3, R0, R4 5. INCR R0 6. IF < R0, R6, L 7. END l = 0 1. LOAD R1, R0, R4 2. L : LOAD R2, R0, R5 3. ADD R4, R5, R4 4. STORE R3, R0, R4 5. INCR R0 6. IF < R0, R6, L, delayed_branch 7. LOAD R1, R0, R4 8. END (previous version: e = 3/5) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Increaselogicaldependencydistance: example 1. LOAD R1, R0, R4 2. L : LOAD R2, R0, R5 3. ADD R4, R5, R4 4. STORE R3, R0, R4 5. INCR R0 6. IF < R0, R6, L, delayed_branch 7. LOAD R1, R0, R4 8. END For this transformation, the compiler initializes register R3 at the value Base_C - 1 where Base_C is the base address of the result array. d1= 1/6 1. LOAD R1, R0, R4 2. L : LOAD R2, R0, R5 3. ADD R4, R5, R4 4. INCR R0 5. STORE R3, R0, R4 6. IF < R0, R6, L, delayed_branch 7. LOAD R1, R0, R4 8. END (previous version: e = 2/3) Tc = 6 N T = 8 N t MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Exploit bothoptimizationsjointly 1. LOAD R1, R0, R4 2. L : LOAD R2, R0, R5 3. ADD R4, R5, R4 4. STORE R3, R0, R4 5. INCR R0 6. IF < R0, R6, L, delayed_branch 8. END 1. L : LOAD R1, R0, R4 2. LOAD R2, R0, R5 3. INCR R0 4. ADD R4, R5, R4 6. IF < R0, R6, L, delayed_branch 6. STORE R3, R0, R4 7. END (previous version: e = 3/4) Tc = 6 N T = 7 N t MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Exercizes • Compile, with optimizations, and evaluate the completion time of a program that computes the matrix-vector product (operands: int A[M][M], int B[M]; result: int C[M], where M = 104) on a D-RISC Pipeline machine. The evaluation must include the cache fault effects, assuming the Data Cache of 64K words capacity, 8 words blocks, operating on-demand. • Consider the curves at pages 33, 34. Explain their shapes in a qualitative way (e.g. why e tends asymtotically to an upper bound value as TG tends to infinity, …). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking

Presentation Transcript