SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)
330 likes | 479 Vues
SoC Subsystem A cceleration using Application-Specific Processors (ASIPs). Markus Willems Product Manager Synopsys. SoC Design. What to do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced
SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)
E N D
Presentation Transcript
SoC Subsystem Acceleration using Application-Specific Processors (ASIPs) Markus Willems Product Manager Synopsys
SoC Design • What to do when the performance of your main processor is insufficient? • Go multicore? • Application mapping difficult, resource utilisation unbalanced • Add hardwired accelerators? • Balanced but inflexible
SoC Design • What to do when the performance of your main processor is insufficient? ASIPs: application-specific processors • Anything between general-purpose P and hardwired data-path • Deploys classic hardware tricks (parallelism and customized datapaths) while retaining programmability – Hardware efficiency with software programmability
Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions
Architectural Optimization Space ASIP architectural optimization space Parallelism Speciali-zation
Architectural Optimization Space Parallelism Instruction-level parallelism (ILP) Data-level parallelism Task-level parallelism Orthogonalinstruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multicore Multi-threading
Architectural Optimization Space Specialization App.-specificdata types App.-specificinstructions Pipeline Connectivity & storage matching application’s data-flow Integer, fractional, floating-point, bits, complex, vector… Distributed regs, sub-ranges Multiple mem’s, sub-ranges App.-spec. memory addressing App.-spec. data processing App.-spec. control processing Direct, indirect, post-modification, indexed, stack indirect… Any exoticoperator Jumps, subroutines, interrupts, HW do-loops, residual control, predication… Single or multi-cycle Relative or absolute, address range, delay slots…
Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions
32-bitARC HS ProcessorsHigh-Performance for Embedded Applications • Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in 28-nm process* • HS Family products • HS34 CCM, HS36 CCM plus I&D cache • HS234, HS236 dual-core • HS434, HS436 quad-core • Configurable so each instance can be optimized for performance and power • Custom instructions enable integration of proprietary hardware ARC Floating Point Unit JTAG User Defined Extensions ARCv2 ISA / DSP Real-Time Trace 10-stage pipeline MAC & SIMD Multi-plier ALU Divider Late ALU Memory Protection Unit Instruction CCM Data Cache Data CCM Instruction Cache *Worst case 28-nm silicon and conditions Optional
Pedestrian Detection and HOG • Pedestrian detection • Standard feature in luxury vehicles • Moving to mid-size and compact vehicles in the next 5-10 years, also due to legislation efforts • Implementation requirements • Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades) • Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)
Histogram Of Oriented Gradients Scale to Multiple Resolutions Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames. Gradient Computation Apply Sobel operators:and
Histogram Of Oriented Gradients Histogram Computation The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients. Normalization of the Histograms (1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization Support Vector Machine Linear classification of histogramsfor every 64x128 windows position. Non-Max Suppression Cluster multi-scale dense scan of detection windows and select unique
HOG Functional Validation on ARC HS(640 x 480 pixels) 1 • OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) … D D ASIP1 ASIP2 ASIPn AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl
Task Assignment #2 2 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D ASIP1 ASIP2 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM
ASIP Example: HISTOGRAM • Vector-slot next to existing scalar instructions (VLIW) • 16x(8/16)-bit vector register files • 16x8-bit SRAM interface • 16x8-bit FIFO interfaces • Vector arithmetic instructions • Special registers and instructions to compute histograms 4x size increase & 200x speedup (relative to RISC template) Implemented in less than 1 week
Task Assignment #3 3 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1 ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM
Task Assignment #4 4 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM
Task Assignment #4 4’ Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS L2 SRAM DCCM L3 Ext. DRAM
Comparison 1 2 3 4
Final Results • 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM • 30 frames/second at 500 MHz • Functionally identical to OpenCV reference • TSMC 28nm • ASIP gate count: 330k gates • ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD usage • Power/performance/area via ASIPs • Scaling due to multi-core, specialization and SIMD usage • Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture
Scenario: Need for Flexible FEC Core • Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi • Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area DVB-X? LDPC-A .11n LDPC-C .11n Vit FlexFEC (turbo/LDPC/Vit) .16e LDPC-D 3GPP-LTEturbo-A UMTS Turbo-B
Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6 ILP: 2 FU (scalar+vector unit) ILP: 6 FU (1 scalar+5 vector units) No duplication for arithmetic functionality For exploiting ILP to increase throughput 2 FUs for local memory access
Fast Area/Performance Trade-off(40nm logical synthesis Processor only) 0.189 sqmm 0.177 sqmm
Architectural ExplorationFU Utilization: 2 5 Vector slot separated in different FUs without overlapping functionality Local memory access congestion
Architectural ExplorationMore Balanced FU Utilization: 5 6
Blox-LDPC ASIP Latest IP Available from IMEC Instances available ad
Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions
Conclusion • ASIPs enable programmable accelerators • IP Designer enables efficient design and programming of ASIPs • “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators • ASIPsenable balanced multicore SoCarchitectures