Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance ComputingSystems and EnablingPlatforms Marco Vanneschi 3. Run-timeSupport of InterprocessCommunications

From the CourseObjectives • Provide a solidknowledgeframeworkofconcepts and techniques in high-performance computer architecture • Organization and structureofenablingplatformsbased on parallelarchitectures • Supporttoparallelprogrammingmodels and software developmenttools • Performance evaluation (costmodels) • Methodologyforstudyingexisting and future systems • Technology: state-of-the-art and trends • Parallelprocessors • Multiprocessors • Multicore / manycore / … / GPU • Sharedvsdistributedmemoryarchitectures • Programmingmodels and theirsupport • General-purposevsdedicatedplatforms MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Main Memory and Cache Levels CPU CPU CPU CPU CPU HPC enablingplatforms • Sharedmemorymultiprocessors • Varioustypes (SMP, NUMA, …) Main Memory and/or Cache levels . . . • From simple to very sophisticated memory organizations • Impact on the programming model and/or process/threads run-time support MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

2100 2100 2100 2100 2100 2100 2100 2100 CPU CPU CPU HPC enablingplatforms: shared and distributedmemoryarchitectures InstructionLevelParallelism CPU (pipeline, superscalar, multithreading, …) M CPU . . . Shared memory multiprocessor “Limited degree” Interconnection Network (“one-to-few”) Distributed memory multicomputer: PC cluster, Farm, Data Centre, … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Course Big Picture Independent from the process concept Applicationdevelopedthroughuser-friendlytools Applications compiled / interpreted into Architecture independent Processes Parallelprogramas a collectionofcooperatingprocesses (messagepassing and/or sharedvariables) compiled / interpreted into a program executable by Assembler • Firmware •  Uniprocessor : Instruction Level Parallelism •  Shared Memory Multiprocessor: SMP, NUMA,… • Distributed Memory : Cluster, MPP, … Firmware Architecture1 Run-time support to process cooperation: distinct and different for each architecture Architecture2 Architecture3 Architecture m Hardware MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Interprocesscommunicationmodel • The parallel architecture will be studied in terms of: • Firmware architecture and operating system impact • Cost model / performance evaluation • Run-time support to process cooperation / concurrency mechanisms • Without losing generality, we will assume that the intermediate version of a parallel program is according to the: message-passing model MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Source process Channel of type T Destination process Variables of type T Messagepassingmodel send (channel_identifier, message_value) receive (channel_identifier, target_variable) • send, receive primitives: commands of a concurrent language • Final effect of a communication: target variable = message value • Communications channels of a parallel program are identified by unique names. • Typed communication channels • T = Type (message) = Type (target variable) • For each channel, the message size (length) Lis known statically • Known asynchrony degree: constant ≥ 0 (= 0: synchronous channel) • Communication forms: • Symmetric (one-to-one), asymmetric (many-to-one) • Constant or variable names of channels MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

A B C D Symmetric channel Asymmetric channel Symmetric and asymmetric channels can be synchronous or aynchronous MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Example: farm parallelprogram parallel EMITTER, WORKER[n], COLLECTOR; channelnew_task[n]; EMITTER :: channel ininput_stream(asynchrony_degree), worker_free(1); channel out var scheduled; item1 x; while true do receive (input_stream, x); receive(worker_free, scheduled); send (scheduled, x )  WORKER[i] :: channel innew_task[i] (1); channel outworker_free, result; item1 x, y; send (worker_free, new_task[i]); while true do  receive (new_task[i], x); send (worker_free, new_task[i]); y = F(x); send(result, y   COLLECTOR :: channel inresult (1); channel outoutput_stream;item1 y; while true do receive (result, y); send (output_stream, y)  • Thisprogrammakesuseof • symmetricchannels (e.g. input_stream, output_stream)) • asymmetricchannels(e.g. worker_free, result) • constant-namechannels(e.g. input_stream) • variable-namechannels(e.g. scheduled) • arrayofprocesses(WORKER[n]) • arrayofchannels(new_task[n]) • In the studyofrun-timesupport, the focus willbe on the basic case: • symmetric, constant-name, scalar channels MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Costmodels and abstractarchitectures • Performance parameters and cost models • for each level, a cost model to evaluate the system performance properties • Service time, bandwidth, efficiency, scalability, latency, response time, …, mean time between failures, …, power consumption, … • Static vs dynamic techniques for performance optimization • the importance of compiler technology • abstract architecture vs physical/concrete architecture • abstract architecture: a semplified view of the concrete one, able to describe the essential performance properties • relationship between the abstract architecture and the cost model • in order to perform optimizations, the compiler “sees” the abstract architecture • often, the compiler simulates the execution on the abstract architecture MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Exampleofabstractarchitecture Process Graph for the parallel program = Abstract Architecture Graph (same topology) . . . Proc. Node Proc. Node Proc. Node Proc. Node Proc. Node Proc. Node Interconnection Structure: fullyinterconnected (all-to-all) • Processing Node= (CPU, memoryhierarchy, I/O) • Samecharacteristicsof the concrete architecturenode • Parallelprogramallocationonto the AbstractArchitecture: oneprocess per node • Interprocesscommunicationchannels: one-to-onecorrespondencewith the AbstarctArchitecture interconnection network channels MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Source process Channel of type T Destination process Variables of type T Costmodelforinterprocesscommunication Tsend = Tsetup + L Ttransm • Tsend=Averagelatencyofinterprocesscommunication • delayneededforcopying a message_valueinto the target_variable • L = Messagelength • Tsetup, Ttransm: knownparameters, evaluatedfor the concrete architecture • Moreover, the costmodelmust include the characteristicsofpossibleoverlappingofcommunication and internalcalculation send (channel_identifier, message_value) receive (channel_identifier, target_variable) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

2100 2100 2100 2100 2100 2100 2100 2100 CPU CPU CPU ParametersTsetup, Ttransmevaluatedasfunctionsofseveralcharacteristicsof the concrete architecture InstructionLevelParallelism CPU (pipeline, superscalar, multithreading, …) Memory access time, interconnection network routing and flow-control strategies, CPU cost model, and so on M CPU . . . Shared memory multiprocessor “Limited degree” Interconnection Network (“one-to-few”) Distributed memory multicomputer: PC cluster, Farm, Data Centre, … MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Abstractarchitectures and costmodels Independent from the process concept Applicationdevelopedthroughuser-friendlytools Applications compiled / interpreted into . . . Architecture independent Abstractarchitecture and associatedcostmodels for the different concrete architectures: … Ti = fi (a, b, c, d, …) Tj = fj (a, b, c, d, …) … Processes Parallelprogramas a collectionofcooperatingprocesses (messagepassing and/or sharedvariables) compiled / interpreted into a program executable by Assembler • Firmware •  Uniprocessor : Instruction Level Parallelism •  Shared Memory Multiprocessor: SMP, NUMA,… • Distributed Memory : Cluster, MPP, … Firmware Architecture1 Run-time support to process cooperation: distinct and different for each architecture Architecture2 Architecture3 Architecture m Hardware MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Source process Channel of type T Destination process Variables of type T Run-timesupport: sharedvariables, uniprocessorversion send (channel_identifier, message_value) receive (channel_identifier, target_variable) Run-time support implemented through shared variables. Initial case: uniprocessor. The run-time support for shared memory multiprocessors and distributed memory multicomputers will be derived through modifications of the uniprocessor version. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Source process Channel of type T Destinationprocess Variables of type T Run-timesupport: sharedvariables, uniprocessorversion send (channel_identifier, message_value) receive (channel_identifier, target_variable) Basic data structure of run-time support for send-receive: CHANNEL DESCRIPTOR sender_wait: bool (if true: sender process is in WAIT state) receiver_wait: bool (if true: receiver process is in WAIT state) buffer: FIFO queue of N = k + 1 positions of type T (k = asynchrony degree) message_length: integer sender_PCB_ref: reference to Process Control Block of sender process receiver_PCB_ref: reference to Process Control Block of receiver process Channel descriptor is shared by Sender process and Receiver process: it belongs to the virtual memories (addressing spaces) of both processes MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Run-timesupportvariables • Channel Descriptor: shared • Sender PCB: shared • Receiver PCB: shared • Ready List: shared, list of PCBs of processes in READY state • Channel Table: private, maps channel identifiers onto channel descriptor addresses • Note: Sender process and Receiver process have distinct logical addresses of Channel Descriptor in the respective logical addressing spaces. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Sharedobjectsforrun-timesupport PCBA PCBB PCBC . . . SHARED OBJECTS Ready List PCB… PCB… PCB… ChannelDescriptor 1 (CH1) ChannelDescriptor 2 (CH2) ChannelDescriptor 3 (CH3) . . . TAB_CHA TAB_CHB Local variables of A Local variables of B Process A Process B MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

First versionofsend-receivesupport send (ch_id, msg_address) :: CH address =TAB_CH (ch_id); putmessagevalueintoCH_buffer (msg_address); if receiver_waitthenreceiver_wait = false; wake_uppartner process (receiver_PCB_ref) ; if buffer_fullthen sender_wait = true; process transition into WAIT state: contextswitching receive (ch_id, vtg_address) :: CH address =TAB_CH (ch_id); if buffer_emptythen receiver_wait = true; process transition into WAIT state: contextswitching else getmessagevaluefrom CH buffer and assignitto target variable (vtg_address); if sender_waitthensender_wait = false; wake_uppartner process (sender_PCB_ref)  MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Notes • send, receive are indivisibleprocedures • in uniprocessormachines: executedwithdisabledinterrupts • send, receiveproceduresprovideto: • pure communicationbetweenSender and Receiverprocesses, and • low levelschedulingofSender and Receiverprocesses(process state transitions, processor management) • wake_up procedure: put PCB intoReadyList • Processtransitioninto WAIT state (context_switching): • send procedure: Senderprocesscontinuation = the procedure returnaddress • receive procedure: Receiverprocesscontinuation = procedure addressitself(i.e., receive procedure isre-executedwhen the Receiverprocessisresumed) • In thisimplementation, a communicationimpliestwomessagecopies(frommessagevariableinto CH buffer, from CH buffer into target variable) • assuming the mostefficientimplementationmodel: entirely in userspace; • ifimplemented in kernelspace: (several) additionalcopies are necessaryforuser-to-kernel and kernel-to-userparameterpassing. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Costmodel : communicationlatency Communicationlatency: latencyneededto copy a messageinto the corresponding target variable. Sendlatency: Tsend = Tsetup1 + L Ttransm1 Receivelatency: Treceive = Tsetup2 + L  Ttransm2 Withgoodapproximation: Tsetup1 = Tsetup2 = Tsetup Ttrasm1 = Ttransm2 = Ttransm Communicationlatency: Lcom = Tsend + Treceive = 2 (Tsetup + L  Ttransm) L  Ttransm: actions for message copies into / from CH_Buffer; all the other actions are independent of the message length, and are evaluated in Tsetup MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

“Zero-copy” communicationsupport • In userspace, we can reduce to the minimum the numberofmessagecopiesfor a communication: just one copy (“zero-copy” standsfor “zero additionalcopies”) • Direct copy ofmessagevalueinto target variable,“skipping” the channel buffer • Target variablebecomes a sharedvariable (statically or dinamicallyshared) • Channeldescriptordoesn’t contain a buffer queueofmessagevalues: itcontainsall the synchronization and scheduling information, and referencesto target variables • Easy implementationof zero-copy communicationfor • synchronouschannel, • asynchronouschannelwhen the destinationprocessisWAITing; • in any case, Receiverprocesscontinuation = returnaddressof the receiveprocedure (as in the send procedure). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Zero-copyasynchronouscommunication • FIFO queue of target variables implementedin the destination process addressing space • i.e., the Receiver process refers to a different copy of the target variable after every receive execution: Receiver is compiled in this way; • in a k-asynchronous communication, there is a queue of (k + 1) target variables which are statically allocated in the Receiver addressing space; • these target variables are shared with the Sender process (thus, they are allocated – statically or dynamically – in the Sender process addressing space too). • Channel descriptor: • WAIT: bool (if true, Sender or Receiver process is WAITing) • message length: integer • Buffer: FIFO queue of (reference to target variable, validity bit) • PCB_ref: reference to PCB of WAITing process (if any) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Zero-copyimplementation Principle: Sender copies the message into an instance of the target variable (in FIFO order), on condition that it is not currently used by Receiver (validity bit used for mutual exclusion) Destination process addressing space Channel Descriptor WAIT, L, PCB_ref Buffer Q_VTG FIFO queue of (reference to target variable, validity bit) VTG [0] VTG [1] Array of k+1 positions . . . VTG [k] Current Size of queue Insertion Pointer Extraction Pointer MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Zero-copyimplementation • Send: copies the messageinto the target variablereferredby the CH_BufferInsertionPointer, on conditionthat the validity bit isset, and modifies the CH_Buffer state accordingly. • Receive: ifCH_Bufferisnotempty, resetthe validity bit of the CH_Buffer position referredby the CH_BufferExtractionPointer, and modifies the CH_Buffer state accordingly • at thispoint, the Receiverprocess can utilize the copy of the target variablereferredby the CH_BufferExtractionPointerdirectly(i.e., withoutcopyingit); when the target variableutilizationiscompleted, the validity bit issetagain in the CH_Buffer position. During the target variable utilization by the Receiver, its value can not be modified by the Sender (validity bit mechanism). Alternative solutionforReceiverprocess compilation: loopunfoldingofReceiverloop (receive – utilize ) and explicitsynchronization. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Zero-copyimplementation send (ch_id, msg_address) :: CH address =TAB_CH (ch_id); if (CH_Buffer [Insertion_Pointer].validity_bit = 0) then wait = true; copy reference to Sender_PCB into CH.PCB_ref process transition into WAIT state: contextswitching; copy messagevalueinto the target variablereferredby CH_Buffer [Insertion_Pointer].reference_to_target_variable ; modifyCH_Buffer. Insertion_Pointer and CH_Buffer_Current_Size; if wait then wait = false; wake_uppartner process (CH.PCB_ref) ; if buffer_fullthen wait = true; copy reference to Sender_PCB into CH.PCB_ref process transition into WAIT state: contextswitching MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Exercize Writeanequivalentrepresentationof the following benchmark, using zero-copy communication on a k-asynchronouschannel: Receiver :: intA[N]; int v; channel in ch (k); for(i = 0; i < N; i++)  receive (ch, v); A[i] = A[i] + v  The answerincludesthe definition of the receive procedure, and the Receiver code beforeand after the receive procedure for a correct utilization of zero-copy communication (send procedure: see previuos page). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Costmodelfor zero-copy communication • Zero-copyimplementation: • the receivelatencyisnegligiblecomparedto the sendlatency: the Receiverdoesn’t copy the messageinto the target variable; • the message copy into the target variableisalmostentirelypaidby the Sender; • the receive primitive overheadislimitedtorelativelyfewoperationsforsynchronization. • Withgoodapproximation, the latencyofaninterprocesscommunicationis the sendlatency: Lcom = Tsend = Tsetup + L  Ttransm • Typicalvaluesforuniprocessorsystems: • Tsetup= (102 – 103) t • Ttransm = (101 – 102) t • One or more orderofmagnitudeincreaseforparallelarchitectures, accordingto the characteristicsofmemoryhierarchy, interconnection network, etc. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Zero-copytrade-off • Trade-off of the zero-copy mechanism: • receive latency is not paid, • but some additional waiting phases can be introduced in the Sender behaviour (validity bit). • Strategy: • increase the asynchrony degree in order to reduce the probability of finding the validity bit at “set” during a send • parallel programs with good load balance are able to exploit this feature. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Staticallyvsdinamicallysharedobjects • In orderto reduce the addressingspacesize • e.g., the Senderaddressingspaceas far astarget variablesare concerned (and alsotoimprovegeneralityofuse and protection), we can providemechanismsfor the DYNAMIC allocationof VIRTUAL memoriesofprocesses • e.g., the Senderacquires the target variabledynamically(with “copy” rightsonly) in itsownaddressingspace, and releases the target variableassoonas the message copy hasbeendone. • A generalmechanism, calledCAPABILITY-based ADDRESSING, existsforthispurpose: • e.g., the Receiverpassesto the Sender the Capability on the target variable, i.e. a “ticket” toacquire the target variabledynamically in a protectedmanner. The Senderloses the ticket after the strictlyneededutilization. • Capabilities are a mechanismtoimplement the Referencestoindirectlysharedobjects (referenceto target variables, PCB, … in the ChannelDescriptor). MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Indirectly-referredshared data structures • The so called“sharedpointers” problem • Example: • Channeldescriptoris a sharedobject • itisreferreddirectlybySender and Receiverprocess, eachprocesswithitsownlogicaladdress. • Target variable(VTG), or PCBs (sender PCB, receiver PCB) are sharedobjects • They are referredindirectly: ChannelDescriptorcontainsreferences (pointers) toVTGs and PBCs. Thus, thesereferences (pointers) are sharedobjectsthemselves. • PROBLEM: Howtorepresentthem? • An apparentlysimplesolutionis: sharedreferences (pointers) are physicaladdresses. Despiteitspopularity, manyproblemsarise in thissolution. • Ifwewishtouselogicaladdresses, the PROBLEM exists. • Otherexample: pointedPCBs in ReadyList. Othernotableexampleswillbemet in parallelarchitectures. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Indirectly-referredshared data structures Static solutions: • Coinciding logical addresses: all the processes sharing an indirectly-referred shared data structure have the same logical address of such structure • the shared pointer is equal to the coinciding logical addresss • feasible, however there are problem of addressing space fragmentation • Distinct logical addresses: each process has its own logical address for an indirectly-referred shared data structure • more general solution, no fragmentation problems • the shared pointer is a unique indentifiers of the indirectly referred shared object • each process transforms the identifier into its own logical address (private table) MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Indirectly-referredshared data structures ExampleofstaticsolutionsforPCBs in send-receivesupport: • Coincidinglogicaladdresses: all the PCBshave the samelogicaladdress in all the addressingspaces • CH fieldsSender_PCB_ref and Receiver_PCB_refcontain the coincidinglogicaladdressesofsender PCB and receiver PCB • Distinctlogicaladdresses: • sender PCB and receiver PCB haveuniqueidentifiers (determined at compile time): Sender_PCB_id and Receiver_PCB_id • theseidentifiers are contained in the CH fieldsSender_PCB_ref and Receiver_PCB_ref • when the senderwishestowake-up the receiver, get the Receiver_PCB_id from CH fieldReceiver_PCB_ref, tranformsitinto a private logicaladdressusing a private table, thus can refer the receiver PCB • analogous: for the manipulationofsender PCB by the receiverprocess. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Capabilities Dynamic solution to the shared pointers problem: • The indirectly-referred shared objects are not statically allocated in all the addressing spaces • They are allocated / deallocated dynamically in the addressing spaces of the processes that need to use them • Example: • the Receiver PCB is statically allocated in the receiver addressing space only; • when the Sender process wishes to wake-up the receiver: • it acquires dynamically the Receiver PCB into its addressing space, and • manipulates such PCB properly, • finally releases the Receiver PCB again, deallocating it from its addressing space. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Capabilities Capabilityimplementation • The sharedpointerisan information (capability) thatenables the acquiringprocessto allocate spaceand toreferthe indirectlyreferredsharedobjectbymeansoflogicaladdresses • Formally, a capabilityis a couple (objectidentifier, accessrights) and a mechanismto generate the objectreference • itisrelatedto the conceptofProtectedAddressingSpace • In practice, the implementationof a sharedpointeris the entry of the AddressRelocationTablerelative to the sharedobject • this entry isaddedto the RelocationTableof the acquiringprocess, thusallowingthisprocesstodetermine the logicaladdressof the sharedobject. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

A copies the X’s Relocation Table entry into a shared structure S. MVA MVB TABRILA TABRILB ipl-AX . . . ipl-AX . . . X . . CAP-X . . . . . . * . . . * free ** ** *** *** Free pages Fig. 1 CAP-X S TABRILA TABRILB Fig. 2 ipl-AX . . . CAP-X . . . . . . . . . ipl-BX = old_free CAP-X MVA MVB *** new_free ipl-AX . . . CAP-X X . . . S . . . * ipl-BX ** X Fig. 3 *** Free pages Fig. 4 X: A’s page to be allocated dynamically in B’s space (e.g. PCBA) S: shared structure through which X-capability is passed from A to B (e.g. Channel Descriptor, field Sender_PCB_ref) B copies X-capability into a free position of its Relocation Table. The logical page identifier of X, in B’s space, is equal to the Relocation Table position: thus the logical address of X in B’s space is now determined. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Capabilities • Severaladvantages • minimizedaddressingspaces, • increasedprotection, • solutionofobjectallocationproblemsthar are verydifficult / impossibletobesolvedstatically (notablecaseswillbemet in parallelarchitecturerun-timesupport) • at low overhead • fewcopiesofcapabilites, fewoperationstodetermine the logicaladdress • comparable (or less) withrespecttoDistinctLogicalAddresstechnique and PhysicalAddresstechnique. • An efficienttechniqueforimplementationofnew/mallocmechanisms in a concurrentcontext. • Efficient alternative tokernelspacesolutions. MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms

Master Program (Laurea Magistrale) in Computer Science and Networking

Master Program (Laurea Magistrale) in Computer Science and Networking

Presentation Transcript