Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K.U.Leuven, Belgium

Overview • Context: Introduction to the problem • Motivation for L0 Buffer organization and status • Distributed L0 Buffer organization • Instruction Memory Exploration • Software and Compiler Transformation • Conclusions

Context Low Energy Embedded systems Low Power Embedded Systems • Battery operated (low energy) • 10-50 MOPS/mW • Small • Low cost • Flexible • Multimedia Applications • Video, audio, wireless • High performance • 10-100 GOPS • real-time constraints

Context Embedded systems: Programmable Processor Based Embedded processors • Power Breakdown • 43 % of power in on-chip Memory • StrongARM SA110: A 160MHz 32b 0.5W CMOS ARM processor • 40 % of power in internal memory • C6x, Texas Instruments Inc. 25-30% of power in Instruction Memory To address the data memory issues: • Data Transfer and Storage Methodology (DTSE) F.Catthoor et. al.

Related Work Significant Power consumption in Instruction Memory Hierarchy • Compression (code size reduction) • L. Benini et.al., “Selective Instruction Compression for Memory Energy Reduction...”, ISLPED 1999 • P. Centoducatte et.al, “Compressed Code Execution on DSP Architectures” ISSS 1999 • T. Ishihara et.al., “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000. Main Memory (off-chip) L1 cache (on-chip) • Software Transformations • N. D. Zervas et.al.,”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001 • S. Parameswaran et.al., “I-CoPES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001 Core

Overview • Context: Introduction to the problem • Motivation for L0 Buffer organization and status • Distributed L0 Buffer organization • Instruction Memory Exploration • Software and Compiler Transformation • Conculsions

Application Domain: Multimedia Characteristics (1) Instruction Count Static Instruction Count Dynamic  High locality Instruction count ICstatic < 1% ICdynamic 2% 100% 0% 0%

Application Domain: Multimedia Characteristics (2)  Within a program, few basic blocks or instructions take up most of the execution time (ICdynamic) Normalized dynamic instruction count Normalized static instruction count

Motivation for additional small memory • Size (  basic blockshigh locality) is still large • if L1 cache (on-chip) is made small • performance degrades • capacity (compulsory) misses • system power increases • off-chip memory / bus activity increases Main Memory (off-chip) L1 cache (on-chip) Core Application Domain: high locality in few basic blocks  Small memory, in addition to the conventional L1 cache should be used to reduce energy without compromising performance

Related Work (Microarchitecture):Cache Design Main Memory (off-chip) cache L1 cache (on-chip) Core • N. Jouppi et.al, “Improving direct-mapped cache performance by addition of a small fully-associative cache and prefetch buffers”, ISCA 1990 • Aim: to reduce miss penalty cycles • miss caching, victim caching, stream buffers

Related Work (Microarchitecture):Cache Design Main Memory (off-chip) L1 cache (on-chip) L0 Buffer Core • J. D. Bunda et.al, “Instruction-Processing Optimization Techniques for VLSI Microprocessors”, Phd thesis 1993 • Aim: to reduce instruction cache energy • L0 buffer: cache block buffer (1 cache block + 1 tag) • Limitations: block trashing • J. Kin et.al, “Filtering memory references to increase energy efficiency”, IEEE Trans on Computer, 2000 • Aim: to reduce instruction cache energy • L0 buffer: filter cache • Small regular cache (< 1KB) • L0 access (hit) latency: 1 cycle • L1 access (hit) latency: 2 cycles • Limitations: • Energy reduced at the expense of performance • 256Byte, 58% power reduction with 21% performance degradation

Related Work (Architecture):Software controlled L0 buffers Main Memory (off-chip) L1 cache (on-chip) LC L0 Buffer Initiation Execution L0 Buffer Operation Normal Operation Filling L1 L1 L1 Core L0 L0 L0 Datapath Datapath Datapath Termination R.S. Bajwa et.al, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Trans VLSI Systems, vol 5, no 4, 1997 L. H. Lee et.al, (M-CORE), “Instruction Fetch Energy Reduction Using Loop Caches for Applications with Small and Tight Loops”, ISLPED 1999 • L0 Buffer: Buffer (< 1KB) + Local Controller (LC); [no tags] • L0 / L1 access latency: 1 cycle • Used only for specific program segments (innermost loops) • Software control: • Special instruction (lbon, sbb) to map program segments to L0 buffer

Related Work (Architecture):Software controlled L0 buffers • AssumedArchitecture • MIPS 4000 ISA • Single Issue Processor • L1 Cache • 16KB Direct Mapped • Loop Buffer (2KB) • Depth = 128 instructions • Width = 16 Bytes • Tools • Simplescalar 2.0 • Wattch Power estimator • Loops with less than 128 instructions were hand-mapped onto the loop buffer

Related Work (Architecture):Software controlled L0 buffers • Advantages • 50% (avg) energy reduction, with no performance degradation • Software control: enables to map only a selected program segments • Limitations • Supports only innermost loops (regular basic blocks) • Other basic blocks frequently executed are still fetched from L1 cache • No support for control constructs within loops F. Vahid et.al [2001-2002]: Hardware support for conditional constructs within loops • Identifying the loop address bounds (preloading the program segment/loop) • Sub-routines • conditional constructs • 1 level nested loop

Related Work (Architecture):Compiler controlled L0 buffers Main Memory (off-chip) L1 cache (on-chip) code layout basic blocks allocated to L0 buffer L0 Buffer address space L0 Buffer Core • N. Bellas et.al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998 • Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L0 buffer. • L0 Buffer: Regular cache (< 1KB; 128 instr) • Technique: • profile • function inlining • identify basic blocks • code layout • Advantages • Automated: a ‘tool’ can do this job • Use of basic block as atomic unit of allocation • 60% (avg) energy reduction in i-mem hierarchy [SPEC95] • Limitations • Tag overhead

Loop Buffers: Commercial Processors • RISC DSP Processors • SH-DSP • Decoded instruction buffers • Supports regular loops (no conditional constructs/nested loops) • VLIW Processors • StarCore SC140 • Supports regular and nested loops • Conditional constructs through predication • STMicroelectronics, ST120 • Supports nested loops and loops with conditional constructs

Shortcomings Main Memory (off-chip) • Bottleneck to solve • L0 Buffer organization • Interconnect: from L0 Buffer to Datapath • Efficient buffer controller • Organization Scalable with increase in #FUs L1 cache (on-chip) Increased Accesses (activity) L0 Buffer L0 Buffer LC Core FU FU FU FU Centralized Organization • So far... • Hardware, software, compiler optimizations to increase accesses/activity at L0 Buffers

Current Organizations for L0 Buffers L0 Buffer L0 Buffer Decompressor/Dispatch FU FU FU FU FU FU FU FU • Uncompressed L0 Buffer • Buffer: Width  issue width (# FUS) • Interconnect: Long • LC: Simple Addressing (counter based) • Ref: Bajwa et.al., L.H. Lee et.al., F. Vahid et.al. • Compressed L0 Buffer • Buffer: • High storage density (no NOPs) • Width  issue width (# FUS) • Overhead in decompressing • Interconnect : Still centralized, long lines • LC: Simple Addressing (counter based) • Ref: TI (execute packet fetch mechanism)

Current Organizations for L0 Buffers…. Bank 1 Bank 2 Bank 3 Bank 4 par 1 par 2 par 3 par 4 Re-organizer LC FU FU FU FU FU FU FU FU LC • Partitioned L0 Buffer • Buffer: Smaller memories • Interconnect: Still long • LC: • Simple addressing (counter based) • Need to access all the banks simultaneously, even if some of the FUs are not active • Ref: Sub-banking • Sub-banked/Partitioned L0 Buffer with Compression • Buffer: Smaller memories, overhead in re-organizer • Interconnect: Still centralized • LC: Complex addressing (needs expensive tags) • Ref: T. Conte et.al [TINKER] • No correlation between partitioning and FUs

SolutionDistributed Instruction Buffer Organization A balance of energy consumption between Buffers, Interconnect and Local Controllers is needed • Buffers • Sub-banked/Partitioned in correlation with FU activation Distributor/Dispatch IROC • Interconnect • Localized (limited connectivity b/w FUs and Buffers) ATC Buffers ATC Buffers ATC Buffers • Buffer Control • Stores instructions in each partition • Fetches instructions during loop execution • Regulates the accesses to each partition FU FU FU FU FU Instruction Cluster ATC: Address Translation and Control IROC: Instruction Registers Operation and Control

Distributed L0 Buffer Operation • Similar to conventional L0 buffer operation • Initiation • Special instruction LBON <offset> • Filling • Pre-fetching instructions from <start> to <end> • Termination • When the program flow jumps to an address out of <start> to <end> range Initiation Execution L0 Buffer Operation Normal Operation Filling L1 L1 L1 Distributed L0 Distributed L0 Distributed L0 Datapath Datapath Datapath Termination

The Buffer Operation:An Illustration LBON <offset> for (..) { … if (..) {.….} else {.….} … } S: OP11 OP21 OP31 NOP NOP OP22 OP32 BNZ ‘x’ OP12 NOP NOP BR ‘y’ if block X: OP13 NOP OP33 NOP else block Y: OP14 OP23 NOP BNZ ‘s’

The Buffer Operation:An Illustration PC IROC IR_USE START_ADDR NEW_PC END_ADDR LBON <offset> for (..) { … if (..) {.….} else {.….} … } S: OP11 OP21 OP31 NOP NOP OP22 OP32 BNZ ‘x’ OP12 NOP NOP BR ‘y’ if block X: OP13 NOP OP33 NOP else block Y: OP14 OP23 NOP BNZ ‘s’ 1 0 1 0 1 0 0 - OP11 0 - 1 1 1 1 1 0 OP12 OP21 OP31 BNZ ‘x’ 1 1 0 - 0 - 1 1 OP13 OP22 OP32 BR ‘y’ 1 2 0 - 1 2 0 - OP14 OP23 OP33 BNZ ‘s’ 1 3 1 2 0 - 1 2 FU1 FU2 FU3 BR

Energy Trade-Offs #partitions #partitions #partitions Energy =  E buffer i +  E LC i +  E interconnect i i = 1 i = 1 i = 1  E LC i Baseline 1 Energy (normalized)  E buffer i  E interconnect i 1 #partitions #FUs

Profile Based Clustering Min { Energy(clust, Dynamicprofile, Staticprofile) } max_clusters  clust(i,j) = 1; j S.T i =1 Where, clust (i,j) = 1; if jth FU is assigned to cluster ‘j’ = 0; otherwise Dynamic Trace (during loop execution) Energy Models (Register File) Instruction Cluster A group of functional units with a separate local controller and an instruction buffer partition 1 1 1 0 0 … 1 1 0 1 0 1 … 0 0 1 1 0 1 … 1 . . . 1 1 1 0 1 … 0 - FU grouping - Width and Depth of instruction buffers in each partition Instruction Clustering Instruction Clusters begin 1 1 1 0 0 … 1 1 0 1 0 1 … 0 end begin 0 1 1 0 1 … 1 end Static Trace (loops mapped to L0)

Results #partitions #partitions Energy =  E buffer i +  E LC i i = 1 i = 1 Assumptions - Only the buffers and controller is modeled (no interconnect as yet) - #FUs in datapath = 10 - Fixed Schedule ( activation trace) - Schedule generated using Trimaran 2.0 Energy (normalized) #partitions

In Comparison With Other Schemes Results Shown for ADPCM Uncompressed - CentralizedL0 buffer Compressed - Centralized L0 Buffer - 2 additional registers for VLDecoding Partitioned (no control) - 2 partitions Clustered (width only) - 3 partitions Clustered (width and depth) - 2 partitions Compressed Clustered (varying both width and depth) Uncompressed Paritioned (sub-banked) ( no access regulation ) Clustered (varying width only)

Fully Distributed Instruction Memory Hierarchy Main Memory (off-chip) L1 cache (on-chip) L1 cache (on-chip) L0 Buffers L0 Buffers L0 Buffers L0 Buffers FU FU FU FU FU FU FU FU FU FU FU FU L0 Cluster L1 Cluster

Exploration MethodologyWhat we have Application optimized for performance - maximum cluster activity Energy optimized for Energy - minimal cluster activity Software Transformations Delay Compiler (Scheduling) • Pareto Curve Generation • For Choosing the operating point at Run-time • Enable the designer to asses the trade-off between energy and performance Energy Models Clustering Tool Instruction Clusters

Exploration MethodologyWhat we want to achieve… Application optimized for performance - maximum cluster activity Energy optimized for Energy - minimal cluster activity Software Transformations Delay Compiler (Scheduling & Clustering) • Pareto Curve Generation • For Choosing the operating point at Run-time • Enable the designer to asses the trade-off between energy and performance Energy Models Instruction Clusters Schedule

Compiler Scheduling OP11 OP12 - OP13 - OP14 All 3 clusters need to be active OP11 OP12 OP13 OP14 - - Only 2 clusters need to be active OP11 OP12 - OP13 - OP14 OP21 - OP22 - OP23 - 2 activations of all 3 clusters OP11 OP12 - - - - OP11 - - - - - - - OP22 OP13 OP23 OP14 2 activations for 1st, 1 activation for 2nd and 3rd cluster Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance Energy reduction without performance loss Energy reduction at the expense of performance loss

Software Transformations High level code transformations can also impact/change the clustering result and hence energy and performance • Loop Transformations • Loop splitting • Loop merging • Loop peeling (for nested loops) • Loop collapsing (nested loops) • Code movement across loops • ....etc loop 1 Loop loop 2 Loop Splitting

Conclusions • L0 Buffer Organization • Multimedia applications have high locality in small program segments • An additional small L0 buffer should be used • Current options for L0 buffer still not efficient (energy) • A distributed L0 buffer organization should be sought • But, the clustering/partitioning should be application specific • L1 Cache Organization • Distributed (?) • Instruction Memory Exploration • Software transformations and compiler scheduling can change the clusterting results • An exploration methodology should be sought to analyze the trade-offs in energy and performance (pareto curves)

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Presentation Transcript

Distributed Systems Architecture

Lecture 2: Embedded Systems Architecture

Time for High-Confidence Distributed Embedded Systems

Low complexity and distributed energy minimization

Computer Architecture and Embedded Systems

A distributed software-centric architecture for reconfigurable embedded systems

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Distributed Data Management Architecture for Embedded Computing

Energy-Efficient Mapping and Scheduling for DVS Enabled Distributed Embedded Systems

A Decompression Architecture for Low Power Embedded Systems

Software Architecture for Distributed and Mobile Systems

Distributed Reorder Buffer Schemes for Low Power *

Dependable communication synthesis for distributed embedded systems

Communication strategies for distributed embedded systems

Compiling Java for Low-End Embedded Systems

Low-Complexity Reorder Buffer Architecture*

Compressed Tag Architecture for Low-Power Embedded Cache Systems

Design-for-Debug Architecture for Distributed Embedded Logic Analysis

Distributed Software Architecture and Distributed Systems Middleware

Embedded Systems Architecture

Embedded Systems Architecture