Compiler Issues for Embedded Processors

Compiler Issues for Embedded Processors

Contents • Compiler Design Issues • Problems of Compilers for Embedded Processors • Structure of typical C compiler • Front end • IR optimizer • Back end • Embedded-code optimization • Retargetable compiler

Compiler Design Issues • For embedded systems the use of compilers is less common. • Designers still use assembly language to program many embedded applications. • Huge programming effort • Far less code portability • Maintainability • Why is assembly programming still common? • The reason lies in embedded systems’ high-efficiency requirements.

Problems of Compilers for Embedded Processors • Embedded systems frequently employ application-specific instruction set processors (ASIPs) • Meet design constraints more efficiently than general-purpose processor • E.g., performance, cost and power consumption • Building the required software development tool infrastructure for ASIPs is expensive and time-consuming • Especially true for efficient C and C++ compiler design, which requires a large amount of resources and expert knowledge. • Therefore, C compilers are often unavailable for newly designed ASIPs.

Problems of Compilers for Embedded Processors • Many existing compilers for ASIPs (e.g., DSPs) generate low-quality code. • Compiled code may be several times larger and/or slower than handwritten assembly code. • This poor code is virtually useless for efficiency reason.

Problems of Compilers for Embedded Processors • The cause of the poor code quality is highly specialized architecture of ASIPs, whose instruction sets can be incompatible with high-level languages and traditional compiler technology; • Because an instruction set is generally designed primarily from a hardware designer’s viewpoint, and • the architecture is fixed before considering compiler issues.

Problems of Compilers for Embedded Processors • Problems of compiler unavailability must be solved, because; • Assembly programming will no longer meet short time-to-market requirements • Future human programmers are unlikely to outperform compilers • As processor architectures become increasingly complex (e.g., deep pipelining, predicated execution, and high parallelism) • Application program should be machine-independent (e.g., C language) for architecture exploration with various cost/performance tradeoffs.

Coarse structure of typical C compiler Source code Optimized IR Front end (scanner, parser, semantic analyzer) Back end (code selection, register allocation, scheduling, peephole optimization) Intermediate representation (IR) Assembly code IR optimizer (constant folding, constant propagation, jump optimization, loop-invariant code motion, dead code elimination)

Front end • The front end translates the source program into a machine-independent IR • The IR is stored in a simple format such as three-address code • Each statement is either an assignment with at most three operands, a label, or a jump • The IR serves as a common exchange format between the front-end and the subsequent optimization passes, and also forms the back-end input L1: i ← i+1 t1 ← i+1 t2 ← p+4 t3 ← *t2 p ← t2 t4 ← t1 < 10 *r ← t3 if t4 goto L1 Example IR (MIR code)

Front end • Front end’s main component • Scanner • Recognizes certain character string in the source code • Groups them into tokens • Parser • Analyzes the syntax according to the underlying source-language grammar • Semantic analyzer • Performs bookkeeping of identifiers, as well as additional correctness checks that the parser cannot perform • Many tools (e.g, lex and yacc) that automate the generation of scanners and parsers are available

IR optimizer • The IR generated for a source program normally contains many redundancies • such as multiple computations of the same value or jump chains, because the front end does not pay much attention to optimization issues • Human programmer might have built redundancies into the source code, which must be removed by subsequent optimization passes

IR optimizer • Constant folding • replaces compile-time constant expressions with their respective values • Constant propagation • Replaces variables known to carry a constant value with the respective constant • Jump optimization • Simplifies jumps and removes jump chains • Loop-invariant code motion • Moves loop-invariant computations out of the loop body • Dead code elimination • Removed computation whose results are never needed

ex) Constant Folding • Now the IR optimizer can apply constant folding to replace both constant expressions by constant numbers, thus avoiding expensive computations at program runtime void f() { int A[10]; A[2] = 3 * 5; } void f() { int A[10], t1, t3, *t5; char *t2, *t4; t1 = 3 * 5; t4 = (char *) A; t3 = 2 * 4; t2 = t4 + t3; t5 = (int *) t2; *t5 = t1; } → from the source code → array index 2 by the number of memory words C example: An element array A is assigned a constant C-like IR notation of the Lance compiler system Unoptimized IR with two compile-time constant expressions

IR optimizer • A good compiler consists of many such IR optimization passes. • Some of them are far more complex and require an advanced code analysis. • There are strong interaction and mutual dependence between these passes. • Some optimizations enable further opportunities for other optimization. • should be applied repeatedly to be most effective

Back end • The back end (or code generator) • maps the machine-independent IR into a behaviorally equivalent machine-specific assembly program. • Statement-oriented IR is converted into a more expressive control/dataflow graph representation. • Front end and IR optimization technologies are quite mature but the back end is often the most crucial compiler phase for embedded processors.

Major back end passes • Code selection • maps IR statement into assembly instructions • Register allocation • assigns symbolic variables and intermediate results to the physically available machine registers • Scheduling • arranges the generated assembly instructions in time slots • considers inter-instruction dependencies and limited processor resources • Peephole optimization • relatively simple pattern-matching replacement of certain expensive instruction sequences by less expensive ones.

Back end passes for embedded processors • Code selection • To achieve good code quality, it must use complex instructions • multiply-accumulate(MAC), load-with-autoincrement, etc. • Or it must use subword-level instructions (have no counter part in high-level language) • SIMD and network processor architectures • Register allocation • Utilize a special-purpose register architecture to avoid having too many stores and reloads between registers and memory • If the back end uses only traditional code generation techniques, the resulting code quality may be unacceptable

Example: Code selection with MAC instructions temporary variable • Dataflow graph (DFG) representation of a simple computation • Conventional tree-based code selectors must decompose the DFG into two separate trees. Fail to exploit the MAC instructions • Covering all DFG operation with only two MAC instructions requires code selector to consider the entire DFG

Example: Register Allocation LOD R1, C MUL R1, D STO R1, Temp1 LOD R1, B ADD R1, Temp1 STO R1, A LOD R1, C MUL R1, D STO R1, Temp2 LOD R1, A SUB R1, Temp2 STO R1, B LOD R1, C MUL R1, D ;C*D LOD R2, B ADD R2, R1 ;B+C*D STO R2, A SUB R2, R1 ;A-C*D STO R2, B A = B + C * D B = A - C * D Source Program Simple Register Allocation Smart Register Allocation

Embedded-code optimization • Dedicated code optimization techniques • Single-instruction, multiple-data instructions • Recent multimedia processor use SIMD instructions, which operate at the subword level. (ex. Intel MMX) • Address generation units (AGUs) • Allow address computation in parallel with regular computations in the central datapath • Good use of AGUs is mandatory for high code quality • Code optimization for low power • In addition to performance and code size, power efficiency is increasingly important • Must obey heat dissipation constraint, efficient use of battery capacity in mobile systems

Embedded-code optimization • Dedicated code optimization techniques (cont’d) • Code optimization for low power (cont’d) • Compiler can support power savings • Generally, the shorter the program runtime, the less energy is consumed • “Energy-conscious” compilers armed with an energy model of the target machine, give priority to the lowest energy-consuming instruction sequences • Since a significant portion of energy is spent on memory accesses, another option is to move frequently used blocks of program code or data into efficient cache or on-chip memory

Retargetable compiler • To support fast compiler design for new processors and hence support architecture exploration, researchers have proposed retargetable compilers • A retargetable compilers can be modified to generate code for different target processors with few changes in its source code.

Example: CHESS /CHECKERSRetargetable Tool Suites • CHESS/CHECKERS • is a retargetable tool-suite for flexible embedded processors in electronic systems. • supports both the design and the use of embedded processors. These processors form the heart of many advanced systems in competitive markets like telecom, consumer or automotive electronics. • is developed and commercialized by Target Compiler Technologies. http://www.retarget.com

Example: CHESS /CHECKERSRetargetable Tool Suites http://www.retarget.com

Example: CHESS /CHECKERSRetargetable Tool Suites

ASIP(Application-Specific Instruction Set Processor) Design

Reference • J.H.Yang et al, “MetaCore: An Application-Specific DSP Development System”, 1998 DAC Proceedings, pp. 800-803. • J.H.Yang et al, “MetaCore: An Application-Specific Programmable DSP Development System”, IEEE Trans. VLSI Systems, vol 8, April 2000, pp173-183. • B.W.Kim et al, “MDSP-II:16-bit DSP with Mobile Communication Accelerator”, IEEE JSSC, vol 34, March 1999, pp397-404.

Part I : ASIP in general • ASIP is a compromise between GPP(General-Purpose Processor) which can be used anywhere with low performance and full-custom ASIC which fits only a specific application but with very high performance. • GPP, DSP, ASIP, FPGA, ASIC(sea of gates), CBIC(standard cell-based IC), and full custom ASIC in the order of increasing performance and decreasing adaptability. • Recently, ASIC as well as FPGA contains processor cores.

Cost, Performance,Programmability, and TTM(Time-to-Market) • ASIP (Application-Specific Instruction set Processor) • ASIP is a tradeoff between the advantages of ‘general-purpose processor’ (flexibility, short development time) and those of ‘ASIC’ (fast execution time). Execution time General-purpose processor ASIP Rigidity Cost (NRE+chip area) Depends on volume of product ASIC Development time

Comparison of TypicalDevelopment Time Chip manufacturer time Customer time MetaCore (ASIP) 20 months 3 months Core generation + application code development MetaCore development General-purpose processor 20 months 2 months Core generation Application code development ASIC 10 months

Issues in ASIP Design • For high execution speed, flexibility and small chip area; • An optimal selection of micro-architecture & instruction set is required based on diverse exploration of the design space. • For short design turnaround time; • An efficient means of transforming higher-level specification into lower-level implementation is required. • For friendly support of application program development; • A fast development of a suite of supporting software including compiler and ISS(Instruction Set Simulator) is necessary.

Various ASIP Development Systems Instruction set customization Application programming level Year Selection from predefined super set User-defined instructions PEAS-I (Univ. Toyohashi) 1991 Yes No C-language Risc-like Micro-architecture (register based operation) Generates proper instruction set based on predefined datapath ASIA (USC) 1993 C-language EPICS (Philips) 1993 Yes No assembly DSP-oriented Micro-architecture (memory based operation) CD2450 (Clarkspur) 1995 Yes No assembly MetaCore (KAIST) 1997 Yes Yes C-language

Part II : MetaCore System • Verification with co-generated compiler and ISS • MetaCore system • ASIP development environment • Re-configurable fixed-point DSP architecture • Retargetable system software • C-compiler, ISS, assembler • MDSP-II : a 16-bit DSP targeted for GSM applications.

Performance/cost efficient design Diverse design exploration Short chip/core design turnaround time Automatic design generation In-situ generation of application program development tools The Goal of MetaCore System • Supports efficient design methodology for ASIP targeted for DSP application field.

Overview: How to Obtain a DSP Core from MetaCore System Instructions Architecture template Functional blocks Primitive class Adder add and or sub . . . . Bus structure Multiplier Shifter Data-path structure . . . . Optional class mac max min . . . . Pipeline model Select architectural parameter Select instructions Select functional blocks Benchmark Programs Simulation Modify architecture No No OK? Add or delete instructions Add or delete functional blocks Yes HDL code generation Logic synthesis

System Library & Generator Set: Key Components of MetaCore System Processor Specification Benchmark Programs Modify specification Compiler generator ISS generator Simulation C compiler ISS modify Modify Add Add Evaluation Generator set accept Set of functional blocks HDL generator Architecture template Set of instructions - bus structure - instruction’s definition - parameterized HDL code Synthesizable HDL code - pipeline model - I/O port information - related func. block - gate count - data-path structure System Lib.

. . . . . . Processor Specification (example) • Specification of target core • defines instruction set & hardware configuration. • is easy for designer to use & modify due to high-level abstraction. //Specification of EM1 (hardware ACC 1 Hardware configuration AR 4 pmem 2k, [2047: 0] ) (def_inst ADD (operand type2 ) (ACC <= ACC + S1 ) Instruction set definition (extension sign ) (flag cvzn ) (exestage 1 )

Benchmark analysis • is necessary for deciding the instruction set. • produces information on • the frequency of each instruction to obtain cost-effective instruction set. • the frequent sequence of contiguous instructions to reduce to application-specific instructions. ; a0=|mem[ar1]| abs a0, ar1 abs a0, ar1 ; a1=0 clr a1 clr a1 ; a1=a1+|mem[ar2]| add a1, ar2 add a1, ar2 cmp a1, a0 ; a1=max(a1, a0) max a1, a0 ; if(a1>a0) pc=L1 bgtz L1 L1: ; a1=0 clr a1 ; a1=a1+a0 add a1, a0 Application-specific instruction L1: Frequent sequence of contiguous instructions

Processor Specification Target core Macro-block generation Program memory AGU1 Memory size, address space Instantiates the parameter variables of each functional block Data memory0 Data memory1 Bit-width of functional blocks Controller Control-path synthesis Decoder logic Generates decoder logic for each pipeline stage ALU Multiplier Connectivity synthesis Shifter BMU Connects I/O and control ports of each functional block to buses and control signals Peripherals (Timer, SIO) Register file Synthesizable HDL code HDL Code Generator

Design Example (MDSP-II) • GSM(Global System for Mobile communication) • Benchmark programs • C programs (each algorithm constructing GSM) • Procedure of design refinement Remove infrequent instructions based on instruction usage count Turn frequent sequence of contiguous instructions into a new instruction EM2 (MDSP-II) EM0 EM1 • Initial design containing • all predefined instructions • Final design containing • application-specific • instructions

Evolution of MDSP-II Corefrom The Initial Machine Number of clock cycles (for 1 sec. voice data processing) Gate count Machine EM0 (initial) 53.0 Millions 18.1K EM1 (intermediate) 53.1 Millions 15.0K EM2 (MDSP-II) 27.5 Millions 19.3K Number of clock cycles EM1 EM0 50M 40M EM2 (MDSP-II) 30M 20M 10M Gate count 5K 10K 15K 20K

Design progress MetaCore 5 weeks Layout, Timing simulation 1 week HDL design, Functional simulation 7 weeks Application analysis Time (months) 1 2 3 Tape-out Design Turnaround Time (MDSP-II) • Design turnaround is significantly reduced due to the reduction of HDL design & functional simulation time. • Only hardware blocks for application-specific instructions, if any, need to be designed by the user.

MCAU DALU PU (SIO, Timer) Program Memory PCU AGU Data Memory Overview of EM2 (MDSP-II) 16-bit fixed-point DSP Optimized for GSM 0.6 mm CMOS (TLM), 9.7mm x 9.8mm 55 MHz @5.0V MCAU (Mobile Comm. Acceleration Unit) consists of functional blocks for application-specific instructions 16x16 multiplier 32-bit adder DALU (Data Arithmetic Logic Unit) 16x16 multiplier 16-bit barrel shifter 32-bit adder Data switch network PCU (Program Control Unit) AGU (Address Generation Unit) supports linear, modulo and bit-reverse addressing modes PU (Peripheral Unit) Serial I/O Timer

Conclusions • MetaCore, an effective ASIP design methodology for DSP is proposed. 1) Benchmark-driven & high-level abstraction of processor specification enables performance/cost effective design. 2) Generator set with system library enables short design turnaround time.

Grand Challenges and Opportunities laid by SoC for Korea SoC Conference Oct. 23-24, Coex Conference Center

Can the success story be continued?

Can the success story be continued? • 60년대에 per capita GNP 가 100 불도 안 되던 나라. • 지금은 반도체, 자동차, 핸드폰등 주요 IT 분야에서 Global Player 가 있는 나라. • We need to be proud of our success despite all today’s agony in NASDAQ and terrible politics situation. However, we need more to know why we succeeded and how this can be continued.

What is critical for success in SoC Business?

어떨 때는 Game 도중에 Rule 이 바뀐다. • 시장과 기술의 수요가 점진적으로 변화했다면(game rule) 우리가 일본의 기술을 따라잡는 것은 거의 불가능해 보였다. 50년 이상의 기술 격차. 일정시대에 이 땅에 공과대학만은 없었다.(3멸 대상; 말, 이름, 기술)

It’s people, people! • Internet 과 교통의 발달로 사람의 만남, 상품/기술의 유통이 잦아지고, TTM(Time-to-market) 이 key value 가 되었다. Dynamic 한 환경에서 가장 빠른 시간에 다양한 resource (designer,IP,tool) 를 규합하여 고객에게 deliver 하는 능력; 아무리 기계와 시스템이 있어도 이런 극한 상황에서는 결국 최종 차별화는 사람에게서 나온다.

Compiler Issues for Embedded Processors