1 / 104

Low Power System Level Design Methodologies

Low Power System Level Design Methodologies. Young-Chul Kim Chonnam National Univ. Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr/~yckim. Contents. Introduction to System Level Design Hardware and Software Co-design Re-configurable Processors Other Low Power System Level Designs.

sheng
Télécharger la présentation

Low Power System Level Design Methodologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low Power System Level Design Methodologies Young-Chul Kim Chonnam National Univ. Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr/~yckim

  2. Contents • Introduction to System Level Design • Hardware and Software Co-design • Re-configurable Processors • Other Low Power System Level Designs

  3. Introduction to SOC • SOC will bridge the gap b/w s/w and their implementation • in novel, energy-efficient silicon architecture. • In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level • SOC specs are coming from ICT system engineers rather • than RTL descriptions.

  4. Common Fabric for IP Blocks • Soft IP blocks are portable, but not as predictable as hard IP. • Hard IP blocks are very predictable since a specific physical implementation can be characterized, but are hard to port since are often tied to a specific process. • Common fabric is required for both portability and predictability. • Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.

  5. Four main applications • Set-top box: Mobile multimedia system, base station for the home local-area network. • Digital PCTV: concurrent use of TV,3D graphics, and Internet services • Set-top box LAN service: Wireless home-networks, multi-user wireless LAN • Navigation system:steer and control traffic and/or goods-transportation

  6. Types of System-on-a-Chip Designs

  7. Silicon in 2010 Die Area: 2.5x2.5 cm Voltage: 0.6 V Technology: 0.07 m

  8. Portable systems long battery life light weight small form factor IC priority list power dissipation cost performance Technology direction Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed Why Lower Power

  9. Power(W) Alpha 21164 Alpha 21264 50 P III 500 45 P II 300 40 35 Alpha21064 200 30 25 P6 166 20 P5 66 15 P-PC604 133 10 i486 DX2 66 P-PC601 50 i486 DX25 5 i386 DX 16 i486 DX4 100 i286 i486 DX 50 P-PC750 400 1980 1985 1990 1995 2000 year Microprocessor Power Dissipation

  10. Levels for Low Power Design

  11. Power-hungry Applications • Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management • Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders

  12. New Computing Platforms • SOC power efficiency more than 10GOPs/w • Higher On Chip System Integration: COTS: 100W, SOAC:10W (inter-chip capacitive loads, I/O buffers) • Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures • Mixed signal systems • Reuse of IP blocks • Multiprocessor, configurable computing • Domain-specific, combined memory-logic

  13. Physical gap • Timing closure problem: layout-driven logic and RT-level synthesis • Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets. • Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.

  14. Low Power Design Flow I

  15. Low Power Design Flow II

  16. Three Factors affecting Energy • Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing • All in one Approach(SOC): I/O pin and buffer reduction • Voltage Reducible Hardwares • 2-D pipelining (systolic arrays) • SIMD:Parallel Processing:useful for data w/ parallel structure • VLIW: Approach- flexible

  17. Example 1: Filter: Eliminating Redundant Computations

  18. Example2: IBM’s PowerPC Lower Power Architecture • Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution • 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) • FPU is pipelined so a multiply-add instruction can be issued every clock cycle • Low power 3.3-volt design • Use small complex instruction with smaller instruction length • IBM’s PowerPC 603e is RISC • Superscalar: CPI < 1 • 603e issues as many as three instructions per cycle • Low Power Management • 603e provides four software controllable power-saving modes. • Copper Processor with SOI • IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

  19. Power-Down Techniques • Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

  20. Voltage vs Delay • Use Variable Voltage Scaling or Scheduling for Real-time Processing • Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.

  21. Low Voltage Main Memories

  22. Why Copper Processor? • Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower. • Performance: 40% speed-up • Cost: 30% less expensive • Power: Less power from batteries • Chip Size: 60% smaller than Aluminum chip

  23. Silicon-on-Insulator • How Does SOI Reduce Capacitance ? • Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate • high performance, low power, low soft error

  24. SOC Co-Design Challenges • Current systems are complex and heterogenous Contain many different types of components • Half of the chip can be filled with 200 low-power, RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC • Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.

  25. Configurability • One-M gate reconfigurable, one-M gate hardwired logic. • 50GIPS for programmable components or 500 GIPS for dedicated hardwares • Reduce design risks for which NRE costs will become dominant • 1 V with the watt range

  26. Bridging the architectural gap • Product reliability: design at a level far above the RT level, with reuse factors in excess of 100 • Trade-off: 100MOPs/watt (microprocessor) 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)

  27. Implementing Digital Systems

  28. H/W and S/W Co-design

  29. Hardware/Softrware C0-Design Flow

  30. Three Co-Design Approaches • IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded systems using multiple formalisms for application development” • ASIP co-design: starts with an application, builds a specific programmable processor and translates the application into software code. H/w and s/w partitioning includes the instruction set design. • H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan,Codes,Tosca,Cosyma • H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful),Siera (reuse),Ptolemy (DSP)

  31. Mixing H/W and S/W • Argument: Mixed hardware/ software systems represent the best of both worlds. High performance, flexibility, design reuse, etc. • Counterpoint: From a design standpoint, it is the worst of both worlds • Simulation: Problems of verification, and test become harder • Interface: Too many tools, too many interactions, too much heterogeneity • Hardware/ software partitioning is “AI- complete”!

  32. Partitioning • Performance Requirements • 몇몇의 Function들은 Hardware로의 구현이 더 용이 • 반복적으로 사용되는 Block • Parallel하게 구성되어 있는 Block • Modifiability • Software로 구성된 Block은 변형이 용이 • Implementation Cost • Hardware로 구성된 Block은 공유해서 사용이 가능 • Scheduling • 각각 HW와 SW로 분리된 Block들을 정해진 constraints들에 맞출 수 있도록 scheduling • SW Operation은 순차적으로 scheduling되어야 한다 • Data와 Control의 의존성만 없다면 SW와 HW는 Concurrent하게 scheduling

  33. Low power partitioning approach • Different HW resources are invoked according to the instruction executed at a specific point in time • During the execution of the add op., ALU and register are used, but Multiplier is in idle state. • Non-active resources will still consume energy since the according circuit continue to switch • Calculate wasting energy • Adding application specific core and partial running Whenever one core performing, all the other cores are shut down

  34. Partitioning Process - Derives a graph G - operation and connection - Decomposition of G into a set of clusters - cluster : set of operation - Calculate bus-traffic energy - Pre-select clusters with constraints - Set the number of resources - List scheduling - Test the utilization rate (ASIC or µP) - the utilization rate of µP is supported by SW estimation tool

  35. Design Flow - Max 94% energy saving and in most case even reduced execution time - 16k sell overhead

  36. Interface • Interface Block의 필요성 • Hardware와 Software Block간의 Data 전달 • 효율적인 Interface Block 을 구성해야만 HW/SW Block간의 Overhead를 줄일 수 있다 • Interface 방법 • Shared Memory • FIFO • Handshaking protocol

  37. Logical Bus Architecture • System Bus Signals • address, data, control signals • address space consists of the memory space & I/O space • memory space : memory of the SW component • I/O space : ports within SW & registers in other HW • Port Signals • These are specialized signals capable of directly interfacing between SW & HW component • Interrupt Signals • When SW & HW components have completed an operation, or when an error condition is detected

  38. Co-Simulation • Co-simulation의 필요성 • HW part와 SW part를 함께 Simulation을 할 수 있게 해 줌으로써 구성된 System의 결과를 예측할 수 있다 • System Performance를 예측하여 Synthesis 이전에 지정된 Spec.에 맞도록 System을 재설계할 수 있도록 해 준다 • HW/SW Partitioning을 위한 각 Sub-block의 특성을 예측해 준다 • Co-simulation Tool • Ptolemy • COSSAP • POLIS

  39. Partitioning Example: CDMA Searcher- vada Lab. SKKU

  40. Approach - vada Lab. SKKU - Software oriented design - Dark block : Hardware - Interface : Control signal gen. - Partitioned in terms of speed cost - Change from SW to HW 1. Implementation speed 2. Parallel architecture

  41. Result -vada Lab. SKKU

  42. Low Power CDMA Searcher Project at SKKU 과제명: IS-95기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계 개발기간: 1999.3.1 - 2000.2:28 (12개월) 개발 목적 및 방법: CDMA 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 탐색자 (Searcher Engine)에 대한 RTL수준 저전력 설계 구현. 동작 주파수 : 12.5MHz Data flow graph를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator를 이용한 저전력 설, area와 power를 각각 최대 67.68%, 41.35% 감소 시킴. H/W and S/W Co-design 기법 적용 • San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May. 1999. • Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop, Sep. 1999.

  43. Application- Specific Instruction Processor • Processor architecture tailored not just for application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control) • ASIP characteristics • Greater design cost (processor + compiler) • Higher performance, lower power than commercial cores, more flexibility than ASIC

  44. ASIP Design • Given a set of applications, determine micro architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set) • To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code. • The micro architecture of the processor is a design parameter!

  45. ASIP Design Flow

  46. Compiler Optimizations • Machine independent optimizations • Parallelizing transformations, Common sub-expression elimination, Constant Propagation, Strength reduction, Loop Invariant Code motion • Machine dependent optimizations • Loop unrolling and software pipelining • Static allocation (non- recursive procedure calls) • Storage layout (arrays, scalars) • Optimization of mode setting instructions • Instruction selection, scheduling, and register allocation

  47. Cross-Disciplinary nature • Software for low power:loop transformation leads to much higher temporal and spatial locality of data. • Code size becomes an important objective Software will eventually become a part of the chip • Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation. • Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institutehttp://www.eesi.tue.nl/english)

  48. VLSI Signal Processing Design Methodology • pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering • bit-serial, bit-parallel and digit-serial architectures, carry save architecture • redundant and residue systems • Viterbi decoder, motion compensation, 2D-filtering, and data transmission systems

  49. Low Power DSP • DO-LOOPDominant • VSELP Vocoder : 83.4 % • 2D 8x8 DCT : 98.3 % • LPC computation : 98.0 % DO-LOOPPower Minimization ==> DSPPower Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding

  50. Loop unrolling • The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality. Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

More Related