Embedded Computer Architecture

Embedded Computer Architecture ASIP Application Specific Instruction-set Processor 5KK73 Bart Mesman and Henk Corporaal

Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain specific Application specific processor flexibility efficiency Embedded Computer Archtiecture H.Corporaal and B. Mesman

implementation Appl. domain GP ADSP Appl. domain implementation Application domain specific processors (ADSP or ASIP) • takes a well defined application domain as a starting point • exploits characteristics of the domain (computation kernels) • still programmable within the domain • e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ... performance: clock speed + ILP ILP,DLP, tuning to domain flexible dev. (new apps.) cost effective (high volume) problems - specification manual design, - design time and effort large effort => synthesized cores Embedded Computer Architecture H.Corporaal and B. Mesman

www.adelantetech.com Embedded Computer Architecture H.Corporaal and B. Mesman

OK? more appl.? Design process processor- model e.g. VLIW with shared RFs application(s) instance parameters 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw SW (code generation) HW design Estimations nsec/cycle, area, power/instr Estimations cycles/alg occupation Fast, accurate and early feedback no yes yes no go to phase 2 Embedded Computer Architecture H.Corporaal and B. Mesman

* + * * + * * * * + 0 0 1 2 3 1 2 3 1 3 1 2 * * * * * * 1 1 5 3 4 3 4 4 * + * + + 2 2 3 6 3 6 6 + * + * * * * + * 3 3 7 5 7 8 5 8 8 7 8 * + * * * * * 4 4 9 10 5 9 5 9 5 * + * + 9 10 9 10 ASIP/VLIW architectures: list scheduling Candidate Conflict & Scheduled IPB LIST Priority Comp. Operation * 4 OPB MULT ALU IPB OPB 5 Embedded Computer Architecture H.Corporaal and B. Mesman

x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1)

Application examples (1) 19 instructions per tap!! Embedded Computer Architecture H. Corporaal, and B. Mesman

Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware

source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Embedded Computer Architecture H. Corporaal and B. Mesman

18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption)

ASIP/VLIW architectures: feedback Embedded Computer Architecture H.Corporaal and B. Mesman

Implementation Independent Design Database Low power aspects • Estimation area + speed power Mistral2 Estimation Database Architecture Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : default solution EXU ACTIV AREA POWER alu_1 96% 3469 46196 romctrl_1 48% 39 259 acu_1 26% 327 1209 ipb_1 5% 131 105 opb_1 23% 1804 5801 ctrl 9821 135035 total 15591 188605 • controller responsible for 70% of power consumption • maximum resource-sharing • heavy decision-making : “main” loop with 16 metrics-computations per iteration • EXU-numbers include Registers for local storage 13750 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : no loop-folding EXU ACTIV AREA POWER alu_1 92% 3411 45073 romctrl_1 45% 39 255 acu_1 25% 294 1087 ipb_1 5% 107 86 opb_1 22% 1661 5340 ctrl 4919 70087 total 10431 121928 • area down by 33% • power down by 35% • next step: reduce # of program-steps with second ALU 14247 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : 2 ALU’s EXU ACTIV AREA POWER alu_1 69% 1797 12248 alu_2 65% 1393 8916 romctrl_1 67% 39 255 acu_1 37% 294 1087 ipb_1 8% 149 119 opb_1 33% 2136 6871 ctrl 8957 87235 total 14766 116731 9739 • cycle count down 30% • area up 42% • power down by 5% • next step: introduce ASU to reduce ALU-load Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : 1 x ACS-ASU func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = EXU ACTIV AREA POWER alu_1 20% 261 105 acs_asu_1 83% 2382 3816 or_asu_1 10% 611 122 romctrl_1 16% 65 21 acu_1 36% 294 205 ipb_1 20% 107 43 opb_1 11% 163 35 ctrl 1864 3597 total 5747 7944 1930 • cycle count down 5X • power down20X! Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : 4 x ACS-ASU EXU ACTIV AREA POWER alu_1 94% 243 97 acs_asu_1 95% 1041 420 acs_asu_2 95% 1041 420 acs_asu_3 95% 1041 420 acs_asu_4 95% 1041 420 split_asu_1 47% 90 18 or_asu_1 47% 592 118 romctrl_1 28% 48 6 acu_1 98% 212 85 ipb_1 23% 60 6 opb_1 50% 369 80 ctrl 1306 555 total 7084 2645 425 • cycle count down another 5X • area up 23% • power downanother 3X! Embedded Computer Architecture H.Corporaal and B. Mesman

Implementation Independent Design Database GSM viterbi example : summary Mistral2 72x ! Embedded Computer Architecture H.Corporaal and B. Mesman

OK? OK? more appl.? Discussion: phase 3 processor- model application(s) application(s) SW (code generation) HW design SW (code generation) Freeze processor model no no no yes yes no yes Application software development: constraint driven compilation Exploration phase Embedded Computer Architecture H.Corporaal and B. Mesman

RF2 RF1 RF3 RF4 FU3 FU4 FU1 FU2 flags IR3 IR4 IR1 IR2 Instruction memory Con- trol Embedded Computer Architecture H.Corporaal and B. Mesman

Discussion: problems with VLIWs code size and instruction bandwidth • code compaction = reduce code size after scheduling • possible compaction ratio ? • e.g. p0 = 0.9 and p1 = 0.1 • information content (entropy) = - pi log2 pi = 0.47 • maximum compression factor  2 • control parallelism during scheduling = switch between • different processor models (10% of code = 90% runtime) • architecture • reduce number of control bits for operand addresses • e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only • => use stacks and fifos Embedded Computer Architecture H.Corporaal and B. Mesman

GPU basics Synthetic objects are represented with a bunch of triangles (3d) in a language/library like OpenGL or DirectX plus texture Triangles are represented with 3 vertices A vertex is represented with 4 coordinates with floating-point precision Objects are transformed between coordinate representations Transformations are matrix-vector multiplications 23

GPU DirectX 10 pipeline 24

NVIDIA GeForce 6800 3D Pipeline 25

GeForce 8800 GPU 26 330 Gflops, 128 processors with 4-way SIMD

GPU: Why more general-purpose programmable? All transformations are shading Shading is all matrix-vector multiplications Computational load varies heavily between different sorts of shading Programmable shaders allow dynamic resource allocation between shaders Result: Modern GPUs are serious competitor for general-purpose processors! 27

Fully serial Classical encoding: fetching many nops n n A n n n n n n B n n n n n n n n n n n C n n Mixed serial/parallel n n n n n D n n n n n E n n n n n B A n n C n n F n n n n n n n n n n E n D n n Fully parallel n n n n n n G n F n n n n n n n n n n n n n n H n n n n n n G H A B C D E F G H A B C D E F G H A B C D E F G H A B C D E F G H 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 Velocity encoding Embedded Computer Architecture

Conclusions • ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). • The methodology is interesting for IP creation. • The key problem is retargetable compilation. • A (distributed) VLIW model is a good compromise between HW and SW. • Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback. • GPUs are ASIPs Embedded Computer Architecture H.Corporaal and B. Mesman

Embedded Computer Architecture