Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture(CS05162) TRIPS: A Polymorphous Tiled Architecture Ren Yongqing renyq@mail.ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Outline • Background • TRIPS ISA & Architecture introduction • TRIPS Compile & Schedule • TRIPS Polymorphous • TRIPS ASIC Implementation • Problem USTC CS AN Hong

Key Trends • Power • Slowing (stopping?) frequency increases • Wire Delays • Reliability • Reconfigurable USTC CS AN Hong

SuperScalar Core USTC CS AN Hong

Conventional Microarchitectures USTC CS AN Hong

TRIPS Project Goals • Technology scalable processor and memory architectures • Techniques to scale to 35nm and beyond • Enable high clock rates if desired • High design productivity through replication • Good performance across diverse workloads • Exploit instruction, thread, and data level parallelism • Work with standard programming models • Power-efficient instruction level parallelism • Demonstrate via custom hardware prototype • Implement with small design team • Evaluate, identify bottlenecks, tune microarchitecture USTC CS AN Hong

Key Features • EDGE ISA • Block-oriented instruction set architecture • Helps reduce bottlenecks and expose ILP • Tiled Microarchitecture • Modular design • No global wires • TRIPS Processor • Distributed processor design • Dataflow graph execution engine • NUCA L2 Cache • Distributed cache design USTC CS AN Hong

TRIPS Chip • 2 TRIPS Processors • NUCA L2 Cache • 1 MB, 16 banks • On-Chip Network (OCN) • 2D mesh network • Replaces on-chip bus • Misc Controllers • 2 DDR SDRAM controllers • 2 DMA controllers • External bus controller • C2C network controller USTC CS AN Hong

TRIPS Processor • Want an aggressive, general-purpose processor • Up to 16 instructions per cycle • Up to 4 loads and stores per cycle • Up to 64 outstanding L1 data cache misses • Up to 1024 dynamically executed instructions • Up to 4 simultaneous multithreading (SMT) threads • But existing microarchitectures don’t scale well • Structures become large, multi-ported, and slow • Lots of overhead to convert from sequential instruction semantics • Vulnerable to speculation hazards • TRIPS introduces a new microarchitecture and ISA USTC CS AN Hong

EDGE ISA • Explicit Data Graph Execution (EDGE) • Block-Oriented • Atomically fetch, execute, and commit whole blocks of instructions • Programs are partitioned into blocks • Each block holds dozens of instructions • Sequential execution semantics at the block level • Dataflow execution semantics inside each block • Direct Target Encoding • Encode instructions so that results go directly to the instruction(s) that will consume them • No need to go through centralized register file and rename logic USTC CS AN Hong

Block Formation • Basic blocks are often too small (just a few insts) • Predication allows larger hyperblocks to be created • Loop unrolling and function inlining also help • TRIPS blocks can hold up to 128 instructions • Large blocks improve fetch bandwidth and expose ILP • Hard-to-predict branches can sometimes be hidden inside a hyperblock USTC CS AN Hong

PC Header Chunk 128 Bytes Instruction Chunk 0 128 Bytes Instruction Chunk 1 128 Bytes Instruction Chunk 2 128 Bytes Instruction Chunk 3 128 Bytes TRIPS Block Format • Each block is formed from two to five 128-byte program“chunks” • Blocks with fewer than five chunks are expanded to five chunks in the L1 I-cache • The header chunk includes a block header (execution flags plus a store mask) and register read/write instructions • Each instruction chunk holds 32 4-byte instructions (including NOPs) • A maximally sized block contains 128 regular instructions, 32 read instructions, and 32 write instructions USTC CS AN Hong

Processor Tiles • Partition all major structures into banks, distribute, and interconnect • Execution Tile (E) • 64-entry Instruction Queue bank • Single-issue execute pipeline • Register Tile (R) • 32-entry Register bank (per thread) • Data Tile (D) • 8KB Data Cache bank • LSQ and MHU banks • Instruction Tile (I) • 16KB Instruction Cache bank • Global Control Tile (G) • Tracks up to 8 blocks of insts • Branch prediction & resolution logic USTC CS AN Hong

I G R R R R • GDN: global dispatch network • GDN: global dispatch network • GDN: global dispatch network • GDN: global dispatch network • GDN: global dispatch network • OPN: operand network • OPN: operand network • OPN: operand network • OPN: operand network I I I I D D D D E E E E E E E E E E E E E E E E • GSN: global status network • GSN: global status network • GSN: global status network • GCN: global control network • GCN: global control network Grid Processor Tiles and Interfaces USTC CS AN Hong

R-tile[3] 0 1 2 3 Thread Architecture Register Files R W Read/Write Queues 0 1 2 3 4 5 6 7 Frame D-tile[3] E-tile[3,3] 7 Instruction reservation stations 6 Frame 5 4 7 0 1 2 3 4 5 6 Frame 3 2 OP1 I OP2 1 0 0 Slot 0 7 LSID 31 Mapping TRIPS Blocks to the Microarchitecture USTC CS AN Hong

Header Chunk R-tile[3] 0 1 2 3 Thread Architecture Register Files Inst. Chunk 0 R W Read/Write Queues 0 1 2 3 4 5 6 7 Frame Inst. Chunk 3 D-tile[3] E-tile[3,3] 7 Instruction reservation stations 6 Frame 5 4 7 0 1 2 3 4 5 6 Frame 3 2 OP1 I OP2 1 0 0 Slot 0 7 LSID 31 Mapping TRIPS Blocks to the Microarchitecture Block i mapped into Frame 0 USTC CS AN Hong

Block i+1 mapped into Frame 1 R-tile[3] 0 1 2 3 Thread Header Chunk Architecture Register Files R Inst. Chunk 0 W Read/Write Queues 0 1 2 3 4 5 6 7 Frame Inst. Chunk 3 D-tile[3] E-tile[3,3] 7 Instruction reservation stations 6 Frame 5 4 7 0 1 2 3 4 5 6 Frame 3 2 OP1 I OP2 1 0 0 Slot 0 7 LSID 31 Mapping TRIPS Blocks to the Microarchitecture USTC CS AN Hong

Frame 4 100 10 101 Frame (assigned by GT at runtime) ISA Target Identifier 10 11 I G R R R R Slot (3 bits) Frame (3 bits) Type (2 bits) Y (2 bits) X (2 bits) Target = 87, OP1 100 10 10 101 11 I I I I D D D D E E E E E E E E E E E E E E E E Frame: 100 Slot: 101 OP: 10 = OP1 [10,11] Mapping Target Identifiers to Reservation Stations USTC CS AN Hong

Block Fetch • Fetch commands sent to each Instruction Cache bank • The fetch pipeline is from 4 to 11 stages deep • A new block fetch can be initiated every 8 cycles • Instructions are fetched into Instruction Queue banks (chosen by the compiler) • EDGE ISA allows instructions to be fetched out-of-order USTC CS AN Hong

Block Execution • Instructions execute (out-oforder) when all of their operands arrive • Intermediate values are sent from instruction to instruction • Register reads and writes access the register banks • Loads and stores access the data cache banks • Branch results go to the global controller • Up to 8 blocks can execute simultaneously USTC CS AN Hong

Block Commit • 􀀁 Block completion is detected and reported to the global controller • 􀀁 If no exceptions occurred, the results may be committed • 􀀁 Writes are committed to Register files • 􀀁 Stores are committed to cache or memory • 􀀁 Resources are deallocated after a commit acknowledgement USTC CS AN Hong

Time (cycles) Bi+8 Bi Frame 2 0 5 10 30 40 EXECUTE FETCH COMMIT (variable execution time) Frame 5 Frame 3 Frame 1 Frame 0 Frame 6 Frame 4 Frame 7 Bi+1 Bi+2 Bi+3 Bi+4 Bi+5 Bi+6 Bi+7 • Execute/commit overlapped across multiple blocks • G-tile manages frames as a circular buffer • D-morph: 1 thread, 8 frames • T-morph: up to 4 threads, 2 frames each Block Execution Timeline USTC CS AN Hong

NUCA L2 Cache • 􀀁 Prototype has 1MB L2 cache divided into sixteen 64KB banks • 􀀁 4x10 2D mesh topology • 􀀁 Links are 128 bits wide • 􀀁 Each processor can initiate 5 requests per cycle • 􀀁 Requests and replies are wormhole-routed across the network • 􀀁 4 virtual channels prevent deadlocks • 􀀁 Can sustain over 100 bytes per cycle to the processors USTC CS AN Hong

Compiling for TRIPS C FORTRAN Block Splitting Frontend TRIPS Block Formation Inlining Loop Unrolling/Flattening Scalar Optimizations Register Allocation Splitting for Spill Code Peephole Load/Store ID Assignment Store Nullification Code Generation Alpha TRIPS SPARC PPC Scheduling and Assembly Your standard compiler; you’ve seen this before. USTC CS AN Hong

Fixed Size Constraint: 128 Instructions • O3: every basic block is a TRIPS block • Simple, but not high performance 7 TRIPS Blocks B1 B3 B2 B4 B5 1 TRIPS Block B6 • O4: hyperblocks as TRIPS blocks B7 USTC CS AN Hong

Size Analysis • How big is this block? • 3 Instructions? 5 Instructions? More? Immediate instructions have one target read sp, g1 movi t3, 1 mov t4, t3 read sp, g1 movi t3, 1 store 8(sp), t3 store 8(sp), t4 addi t7, sp, 256 store 128(t7), t4 store 384(sp), t3 Max immediate is 256 USTC CS AN Hong

Too Big? Block Splitting • What if the block is too large? Predicated blocks? Reverse if-convert L0: read t4, g10 addi t3, t4, 16 load t5, 0(t3) … branch L1 write g11, t5 L1: read t4, g10 read t5, g11 subi t7, t5, 8 mult t8, t7, t7 store 0(t4), t8 store 8(t4), t8 branch L2 L0: read t4, g10 addi t3, t4, 16 load t5, 0(t3) … subi t7, t5, 8 mult t8, t7, t7 store 0(t4), t8 store 8(t4), t8 branch L2 Unpredicated (basic blocks)? Insert a branch and label USTC CS AN Hong

Register Constraints: Linear Scan Allocator • 128 registers (32 x 4 banks) • Compute liveness over hyperblocks • Ignore local variables • Hyperblocks as large instructions Total spills (18 Alpha vs 6 TRIPS) Average spill: 1 store for 2-3 loads USTC CS AN Hong

Block Termination Constraint • Block Termination: constant output per block • Constant number of stores and writes execute • One branch • Simplifies hardware logic for detecting block completion • All writes complete • Write nullification • All stores complete • Store nullification • LSID assignment USTC CS AN Hong

sw sw add cmp br TRIPS Scheduling Problem Hyperblock Register File Flowgraph add add ld cmp br add sub shl ld cmp br ld shl add sw br add ld Data Caches ld add add sw br • Place instructions on 4x4x8 grid • Encode placement in target form USTC CS AN Hong

Scheduling Algorithms • Heuristic-based list scheduler [PACT 2005] • Greedy, top-down • Prioritizes critical path • Reprioritizes after each placement • Balances functional unit utilization • Accounts for data cache locality • Accounts for register bank locality USTC CS AN Hong

TRIPS Polymorphous :Different Levels of Parallelism • Instruction-level parallelism[Nagarajan et al, Micro01] • Populate large instruction window with useful instructions • Schedule instructions to optimize communication and • concurrency • Thread-level parallelism • Partition instruction window among different threads • Reduce contentions for instruction and data supply • Data-level parallelism • Provide high density of computational elements • Provide high bandwidth to/from data memory USTC CS AN Hong

TRIPS Configurable Resources USTC CS AN Hong

Aggregating Reservation Stations: Frames USTC CS AN Hong

Extracting ILP: Frames for Speculation USTC CS AN Hong

Configuring Frames for TLP USTC CS AN Hong

Using Frames for DLP USTC CS AN Hong

Configuring Data Memory for DLP USTC CS AN Hong

Configuring Data Memory for DLP • Regular data accesses • Subset of L2 cache banks configured as SRF • High bandwidth data channels to SRF • Reduced address communication • Constants saved in reservation stations USTC CS AN Hong

Performance results • ILP: instruction window occupancy • Peak: 4x4x128 array 􀀁 2048 instructions • Sustained: 493 for Spec Int, 1412 for Spec FP • Bottleneck: branch prediction • TLP: instruction and data supply • Peak: 100% efficiency • Sustained: 87% for two threads, 61% for four threads • DLP: data supply bandwidth • Peak: 16 ops/cycle • Sustained: 6.9 ops/cycle USTC CS AN Hong

ASIC Implementation • 􀀁 130 nm 7LM IBM ASIC process • 􀀁 335 mm2 die • 􀀁 47.5 mm x 47.5 mm package • 􀀁 ~170 million transistors • 􀀁 ~600 signal I/Os • 􀀁 ~500 MHz clock freq • 􀀁 Tape-out : fall 2005 • 􀀁 System bring-up : spring • 2006 USTC CS AN Hong

Functional Area Breakdown USTC CS AN Hong

TRIPS Summary • Distributed microarchitecture • 􀀁 Acknowledges and tolerates wire delay • 􀀁 Scalable protocols tailored for distributed components • Tiled microarchitecture • 􀀁 Simplifies scalability • 􀀁 Improves design productivity • Coarse-grained homogeneous approach with polymorphism. • ILP: Well-partitioned powerful uniprocessor (GPA) • TLP: Divide instruction window among different threads • DLP: Mapping reuse of instructions and constants in grid USTC CS AN Hong

Problems • Scalable? • Larger grid,more communication latency • ILP? DLP?TLP? • Multicore？Manycore？ • Compatibility • Instruction & block code • Low-efficiency architecture • Instruction buffer • I-Cache bank，GDN • Oprand network • LSQ & read/write queue • Polymorphous • How to realize DLP? USTC CS AN Hong

I G R R R R • GDN: global dispatch network • GDN: global dispatch network • GDN: global dispatch network • GDN: global dispatch network • GDN: global dispatch network • OPN: operand network • OPN: operand network • OPN: operand network • OPN: operand network I I I I D D D D E E E E E E E E E E E E E E E E • GSN: global status network • GSN: global status network • GSN: global status network • GCN: global control network • GCN: global control network Scalable • Larger grid? USTC CS AN Hong

Scalable • Multicore • Manycore • ? USTC CS AN Hong

Compatibility USTC CS AN Hong

Low-efficiency architecture Link utilization for SPEC CPU 2000 USTC CS AN Hong

Polymorphous • Imagine? USTC CS AN Hong

Lecture on High Performance Processor Architecture ( CS05162 )