Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture (CS05162) DLP Architecture Case Study: Stream Processor Xu Guang xuguang5@mail.ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Discussion Outline • Motivation • Related work • Imagine • Conclusion • TPA-PD • Future work CS of USTC

Motivation • VLSI technology • More ALUs, Computation is relatively cheap • Keeping them feed hard • The problem is bandwidth • Energy • Delay CS of USTC

Motivation • Data level parallel (DLP) applications • Media application • Real-time graphics • Signal processing • Video processing • Scientific computing • Application characteristics • Dense computing • Parallelism • Territorial CS of USTC

Motivation • Application characteristics • Poorly match conventional architectures • Cache • Instruction-level parallelism • Few arithmetic units • Well matched to modern VLSI technology • Lots (100’s - 1000’s) of ALUs fit on a single chip • Communication bandwidth is the scarce resource CS of USTC

Related work • Vector • Large set of temporary values • Dsp • Signal register file • GPU • Special • General purpose processor • ILP • cache CS of USTC

Related work • Imagine • Prototype HW • 2002 prototype processor • 256mm2 die in 150nm , 21M transistors • Collaboration with TI ASIC • SW based on StreamC / KernelC • Stream scheduler • Communication scheduler • Merrimac • Stream supercomputer • For scientific applications • No prototype CS of USTC

Related work • Cell • Prototype HW • 221 mm2 in 90nm, 234M transistors • AMD&ATI • Stream processor • NUDT • Fei Teng 64 • USTC • ACSA • TPA CS of USTC

Programming model • Stream • organized as a sequence of records • simplex • complex • ordered • finite-length • Vs array • ordered use CS of USTC

Stream type • Basic stream：an array of records • Derived stream：a reference to a subset of records in a basic stream stream<type> name = basic-stream (start, end, data Dependence, access pattern); CS of USTC

Stream type • Sequential access pattern: y=x (start, end) • Strided access pattern: y=x (start，end，data Dependence，stride) • Indexed access pattern y=x (start，end，data Dependence，index Stream) CS of USTC

Programming model • Stream Program • Kernel is a program that performs the same set of operation on each input stream record, and produces one or more output streams. • Express a computation as streams flowing through kernels • Represent applications as a set of computation kernels that consume and produce data streams CS of USTC

Programming example • Stereo depth extractor application Operations within a kernel operate on local data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Output data Streams expose data parallelism Input data CS of USTC

Programming example • Vect add CS of USTC

Why Organize an Application This Way? • Expose parallelismat three levels • ILP within kernels • DLP across stream elements • TLP across sub-streams and across kernels • Keeps ‘easy’ parallelism easy • Expose locality in two ways • Within a kernel – kernel locality • Between kernels – producer-consumer locality • Put another way, stream programs make communication explicit CS of USTC

SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Overall Imagine block diagram CS of USTC

Instructions of Imagine • Stream level CS of USTC

Instructions of Imagine • Kernel level • Integer/Float arithmetic • Bitwise logic and comparison • Data permutation • Stream in/out • Loop control • Operate on packed data • like short vectors (SIMD) CS of USTC

Architecture of Imagine Host Interface • All interactions between the host processor and the imagine core occur via the Host Interface • Stream instructions can be loaded onto Imagine • Several status words can read from Imagine • Individual data words can be read from Imagine • Entire data streams can be transferred to/from Imagine CS of USTC

Architecture of Imagine Stream controller • Responsibilities • Handles the data flow between and control of all of the modules on the chip • Controlling which stream instructions to issue and when they are executed • Parameters • 32 entries instruction queue CS of USTC

Architecture of Imagine SRF • Responsibilities • Source and destination for all memory operations • The source and sink of data to the arithmetic clusters and the network router • Parameters • 8 banks 128KB • Software control • SDR • SCR • Read/Write a block at a time CS of USTC

Architecture of Imagine SRF CS of USTC

Architecture of Imagine SRF 4 4 CS of USTC

Architecture of Imagine Memory System • Responsibilities • Load data from Memory to SRF • Store data from SRF to Memory • Parameters • 4 memory bank • 2 address generators • MSCR CS of USTC

Architecture of Imagine Memory System CS of USTC

Architecture of Imagine Microcontroller • Responsibilities • passing parameters from/to the host interface • loading microprograms into its microcode store • controlling the execution of the microprograms on the arithmetic clusters. • Parameters • 1024×VLIW • 32×32bit regfiles UCRF • 2×1bit regfiles UCONDRF CS of USTC

Architecture of Imagine Microcontroller CS of USTC

Architecture of Imagine Cluster • Responsibilities • 8 clusters perform identical operations in parallel • Controller by microcontroller • Parameters • Each cluster has • 3 ADDER 2 MULER 1 DIVIDER • 1 Scratchpad • 1 Jukebox and 1 Valid Unit • 1 Comm • Internal data path width of 32 bits. • Each functional unit has its own local register files(LRF) • All functional units accept 32-bit inputs and produce 32-bit results • For floating point operations, the units use IEEE floating-point format CS of USTC

Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Architecture of Imagine Cluster CS of USTC

Stream Traverse operations first All operations for one record, then next record Smaller working set of temporary values Store and access whole records as a unit Spatial locality of memory references Vector Traverse records first All records for one operation, then next operation Large set of temporary values Group like-elements of records into vectors Read one word of each record at a time Streams expose Kernel Locality missed by Vectors CS of USTC

Streams expose Kernel Locality missed by Vectors CS of USTC

Mapping App to Imagine mapping CS of USTC

Mapping App to Imagine • compile • Stream level CS of USTC

Mapping App to Imagine • Kernel level CS of USTC

Mapping App to Imagine • Before CS of USTC

Mapping App to Imagine • Stream inst • Memop Load stream input from memory to srf • Memop Load stream ucode from memory to srf CS of USTC

Mapping App to Imagine • SRF data CS of USTC

Mapping App to Imagine • Stream inst • Load ucode fetch ucode from srf to microcode store CS of USTC

Mapping App to Imagine • Stream inst • Cluster op execute ucode vadd CS of USTC

Mapping App to Imagine • Stream inst • Memop Load stream output from srf to memory CS of USTC

SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Performance • Bandwidth Hierarchy CS of USTC

Performance • Bandwidth demand of stream programs fits bandwidth hierarchy of architecture CS of USTC

Performance floating-point application 16-bit applications 16-bit kernels floating-point kernel CS of USTC

Performance • Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 CS of USTC

Conclusion • Performance • compound stream operations realize >10GOPS on key applications • can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) • Power • three-level register hierarchy gives 2-10GOPS/W CS of USTC

Conclusion • Disadvantage • Programming model • Rewrite application • Programmers need to know details of hardware CS of USTC

TPA-PD Motivation • Tiled • Wire delay constraints • Difficult centralized structures dominating today’s designs • Architectural partitioning encourages regularity and re-use • Application • Media APP • Scientific computing • Irregular control and data access CS of USTC

TPA-PD CS of USTC

TPA-PD • Instruction set • Stream level • Kernel level • Explicit Data Graph Execution (EDGE) • Block-Oriented • Direct Target Encoding CS of USTC

TPA-PD • Not centralized control CS of USTC

Lecture on High Performance Processor Architecture ( CS05162 )