Generalized Data Transfers At Memory Bandwidth

Generalized Data TransfersAt Memory Bandwidth Peter A. DindaDavid R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh

Generalized Data Transfers Sending Node Memory Receiving Node Memory A D B E C F

Address Relations Sending Node Memory Receiving Node Memory D A B E C F {(A,F),(B,D),(C,E)} R={(x,y) | data item at address x on sender is copied to address y on receiver}

Send/Recv Implementation Receiving Node Memory Sending Node Memory A D B E C {(A,F), (B,D), (C,E)} F Message Disassembly Message Assembly Message Contents Data Transfer (also put and get communication models)

Storing Address Relations Compute Address Relation - “Inspector” Done Once while not done compute_address_pair(x,y) store_address_pair(x,y) end while Assemble Message - “Executor” while not done get_address_pair(x,y) buffer[i++]=data[x] end while Repeated Many Times

Inspector/Executor [Salz, et al] In-line Computation Inspector/Executor i=1 Inspector i=1 do i=1,1000 call Work() call COPY() call Work() enddo Executor i=2 i=2 Executor i=3 i=3 Executor i=3 Executor

dim A(N,N),B(N,N) do i=1,1000 call Work(A) call Work(B) end Context: Array Assignments B=A Array A Array B Abstraction We concentrate on B=A and B=TRANSPOSE(A) More general forms exist

Distributed Arrays Regular Block-cyclic distributions as in High Performance Fortran(HPF) (*,CYCLIC) (*,BLOCK) (*,CYCLIC(k)) Distribution Elements Processor 0 Owns Local Array on Processor 0

Representative Assignments (BLOCK,*) (CYCLIC,*) (BLOCK,*) (*,BLOCK) (CYCLIC,*) (BLOCK,*) (*,CYCLIC) (CYCLIC,*) Data Transpose

Representing Address Relations • General Purpose • Space Efficiency • Hardware Limited Performance • In-line expansion

AAPAIR: Simple Representation Sending Node Memory Receiving Node Memory D A A F B E B D C F C E {(A,F),(B,D),(C,E)} Simple sequence of pointer pairs PROBLEM: Space Efficiency PROBLEM: Performance

AABLK: Run-length Encoding D A 2 A F 2 B E B D 2 C E C F {(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)} Sequence of pointer, pointer, length triples PROBLEM: Strided Access

DMRLE: Handling Strides D A 1 A F g h 2 B E g h g C F h {(A,F),(B,E),(C,D)} B-A = C-B = g E-F = D-E = h sequence of offset, offset, length triples PROBLEM: Repeated Strides

DMRLEC: Repeated Strides D A E 0 1 2 1 h g B h 1 F 0: A F g C 2 1: g h v D’ 1 2: u v u A’ E’ h g B’ h F’ g C’ {(A,F),(B,E),(C,D), (A’,F’),(B’,E’),(C’,D’)} B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h A’-C = u and F’-D=v Sequence of indices into table of offset, offset, length triples

Address Relation Storage Costs

Copying & Superscalar Plateau Issued at time t Time load store load store load store store load ... stall stall stall stall Free Issue Slots ... p ... Plateau = np = 2*3= 6 n Maximum number of non load/store instructions before copy bandwidth suffers

Paragon: No Superscalar Plat.

Pentium 90: Clear Plateau

DEC 3K/400a: Complex Plateau

Measurement Details • Portable Library written in C • Four representative assignments • 512x512, 1Kx1K, 2Kx2K arrays of doubles distributed on Four processors • Six Machines • Assembly and Disassembly Rates

Measurement Testcases (BLOCK,*) (CYCLIC,*) (BLOCK,*) (*,BLOCK) (CYCLIC,*) (BLOCK,*) (*,CYCLIC) (CYCLIC,*) Data Transpose

Performance: DEC 3K/400a

Performance:IBM 250 (PPC601)

Performance: IBM SP2 (PWR2)

Performance: Paragon

Performance: Pentium 90

Performance: Pentium 133

Conclusions • Exploit “Superscalar Plateau” using compact address relation encodings • Cheap enough even for scalar machines • Generalized data transfer with hardware-limited throughput • Many possible applications

Copying with Address Relations Data Items Copy Engine Data Items Sender Data Addresses Receiver Data Addresses Address Relation Decoder Address Relation Addresses Address Relation Data

A Simple Copy Engine Comm. System Data Data Copy Engine Copy Engine Sender Data Adx Receiver Data Adx Decoder Decoder Address Relation Addresses Address Relation Data Address Relation Data Address Relation Addresses

Generalized Data Transfers At Memory Bandwidth

Generalized Data Transfers At Memory Bandwidth

Presentation Transcript

Optimizing USATLAS Data Transfers

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Introduction to international data transfers

Data transfers into a database

Transfers at UCLA

CMS Data Transfers

Serial versus Parallel Data Transfers

Transfers at Death

Generalized Parton Distributions at

Massive Data Transfers

Lecture 5: Data Transfers

High-bandwidth Networking at UVa

Data Transfers in the ALCF

Wavelet “Block-Processing” for Reduced Memory Transfers

Speeding Up Short Data Transfers

COST TRANSFERS AT-A-GLANCE

GENERALIZED? Bandwidth

Memory Card Image Transfers

Fast Memory-Efficient Generalized Belief Propagation

Massive Data Transfers

Register Transfers and Data Paths

Data Transfers in the ALCF