1 / 47

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 ). Introduction on Reconfigurable Computing. An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology. Outline. Understand Reconfigurable Computing

khristos
Télécharger la présentation

Lecture on High Performance Processor Architecture ( CS05162 )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture on High Performance Processor Architecture(CS05162) Introduction on Reconfigurable Computing An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

  2. Outline • Understand Reconfigurable Computing • What is Reconfigurable Computing? • Motivation: Why needs Reconfigurable Computing? • FPGA的可编程原理 • Reconfigurable Computing Paradigm • Reconfigurable Computing Systems • Case Study • DISC-PAS: A SoC (Supercomputing on a Chip) system on Reconfigurable Platforms • Issues and Challenges CAS ICT AN Hong

  3. What is Reconfigurable Computing(RC)? • Definition 1: • Excuse me for a minute while I Reconfigure my computer... • Definition 2: • Using reconfigurable electronics to build systems with advantages over conventional technologyin any of the areas of: • Time to market... • Performance... • Power... • Size/weight... • Flexibility... • Life cycle cost... CAS ICT AN Hong

  4. ASIC Processor Program Configuration Memory ASIC Data Processor RC Device Data Data Performance Flexibility RC What is Reconfigurable Computing? CAS ICT AN Hong

  5. What is Reconfigurable Computing? • Configuration • “Programming” the hardware • Implements desired functionality in hardware • For example, if addition is necessary, adder made in hardware • Reconfiguration • Changing the configuration of the device • Typical overhead: ~ microseconds to ms • Two types: static and dynamic CAS ICT AN Hong

  6. Types of Reconfiguration • Static/Compile-Time Reconfiguration(CTR) • Feature • Configure the device once for the application • Reconfiguration time not crucial • Do not reconfigure until the application is finished or do not reconfigure at all • Hardware still flexible in the design phase • Application • high performance computing that needs hardware performance without the high cost of designing an ASIC • Uses • standard boards - multiple applications • in-field upgrades • bug fixes • functionality enhancements • Standard CAD flow may be acceptable CAS ICT AN Hong

  7. Types of Reconfiguration • Dynamic/Run-Time Reconfiguration(RTR) • Motivation • eliminate idle circuitry • Feature • Hardware is reconfigured while the application is executing • Reconfiguration time crucial • Partial reconfiguration • Part of the reconfigurable hardware is reconfigured while the rest stays the same and continues to execute • Duration between reconfigurations varies • Long duration: e.g. router updating tables and/or protocols, cell phone switching protocols • Short duration: e.g. regular expression matching, DSP application switching between stages • Research issues: • devices to support rapid reconfiguration • architectures to support RTR • CAD support for RTR designs • applications development CAS ICT AN Hong

  8. Motivation:Why needs Reconfigurable Computing? • Goal:Balance between flexibility and performance • Performance: Application-specific integrated circuits(ASICs) • Highest Performance • The architecture is optimally suited for the applications • Lowest Flexibility • No programmable resources • Flexibility: General-purpose processors(GPPs) or Digital signal processors(DSPs) • Highest Flexibility • ISA => programmability => Flexibility • Performance? • Rather inefficient regarding performance and power consumption • (Hardware)Performance and (Software)Flexibility: Reconfigurable computing(RC) • Combines features of processor and ASIC approaches • High Flexibility • High Performance • Programmable logic which allows to alter the hardware circuits • The realization was enabled through the introduction of FPGA CAS ICT AN Hong

  9. Motivation: Why needs Reconfigurable Computing? • Application Domains for Acceleratorsbased on Reconfigurable Hardware • Data Encryption and Decryption • Network Security • Signal and Image Processing • Gene Sequencing • Medical Imaging • Oil and Gas Exploration • ….. CAS ICT AN Hong

  10. FPGA Basics • FPGA consists of • Matrix of programmable logic cells • Implement any logic function • AND, OR, NOT, etc • Groups of cells make up higher level structures • Adders, multipliers, state logic, etc. • Programmable interconnects • Connect the logic cells to one another • Embedded features • ASICs within the FPGA fabric for specific functions • Embedded multipliers, on-chip memory, microprocessors • FPGAs are SRAM-based • Configure device by writing to configuration memory Logic Cells Interconnects CAS ICT AN Hong

  11. Xilinx Virtex-II Pro • Xilinx terminology • Four-input Look-Up Table (LUT): programmable block implementing four-input logic functions • Slice: 2 LUTs, flip-flops, multiplexers, arithmetic logic, carry logic, and dedicated internal routing • Routing: the interconnection between logic resources giving them the desired functionality • Features • Up to 44,096 slices in one chip • Up to 7792 Kbits of embedded RAM • Up to 444 embedded multipliers • Up to 1200 User I/O pins CAS ICT AN Hong

  12. Device Trends • Xilinx Virtex IV(领域优化平台FPGA) • Up to 200,000 logic cells • Up to 512 “XtremeDSP” slices • Xilinx Virtex IV (Q4, 2004) • Up to 200,000 logic cells • Up to 512 “XtremeDSP” slices • 18-bit by 18-bit, two's complement multiplier with full precision 36-bit result, sign extended to 48 bits • Three input, flexible 48-bit adder/subtracter with optional registered accumulation feedback • Over 40 dynamic user-controller operating modes to adapt XtremeDSP slice functions from clock cycle to clock cycle. CAS ICT AN Hong

  13. Device Trends CAS ICT AN Hong

  14. FPGA: The Basis of Current Reconfigurable Computing FPGA vs. ASIC • The sameness • The capability to implement Application-specific circuits • The key difference • FPGA circuits are programmedby means of a configuration datastream that specifies the logical functionality andconnectivity CAS ICT AN Hong

  15. Hardware vs. Software => Spatial vs. Temporal • ASIC • Implement tasks by spatially composing operations provided by dedicated functional units like adders or multipliers • Processor • general/fixed architecture • Implement tasks by temporally composing operations provided by ALU or FPU • RC • Combines both approaches: implement tasks both in a spatial manner similar to ASICs and in a temporal manner comparable to processors CAS ICT AN Hong

  16. r1 * * B x r1:=x r2:=A*r1 r2:=r2+B r2:=r2*t1 y:=r2+C r2 A B * + A C C + y ALU Spatial Computation Temporal Computation Temporal vs. Spatial computation y = Ax2 + Bx + C CAS ICT AN Hong

  17. x res = 05 add 03 C Mul in in Mul 01 A Mul in B add 02 04 ALU ALU ALU ALU ALU y Spatially Configurable Implementation 01 02 03 04 05 CAS ICT AN Hong

  18. Coarse Design Space for Computing Implementations When computation defined? Computation distributed? Pre-fabrication (hardware) Post-fabrication (software) ASIC Gate-array Reconfigurable Space Processor Time CAS ICT AN Hong

  19. Software Hardware Custom VLSI Gate Array One time Prog. FPGA Processors Media: Binding Time: metal mask Fuse program Load config. Every cycle First mask Fabrication time Instruction Binding Time CAS ICT AN Hong

  20. Performance vs. Flexibility • Current RC research is working to quantify: • performance • power • size/weight • life cycle cost • applicability The performance of ASIC’s with the flexibility of programmable processors CAS ICT AN Hong

  21. RC1 RC2 Algorithm 1 Algorithm 2 RC3 RC4 Algorithm 3 Algorithm 4 RC RC RC Algorithm 14, ……, Algorithm 13, Algorithm 12, Algorithm 11 RC11 Algorithm1 Reconfigurable Computing Paradigm • Multi-mode hardware(need Static/Compile-Time reconfiguration) • Several different algorithms are executed concurrently or sequentially on the same reconfigurable hardware • Temporal partitioning(need Dynamic/Run-time Reconfiguration) • An algorithm is partitioned into several sections,each implemented by an individual circuit and executed sequentially CAS ICT AN Hong

  22. Algorithm uP Critical parts of an algorithm RC Reconfigurable Computing Paradigm • Co-processor • Execution is sped up by implementing critical parts of an algorithm in reconfigurable hardware CAS ICT AN Hong

  23. Reconfigurable Computing Paradigm • Hardware-on-demand(need Dynamic/Run-Time Reconfiguration) • Functions are switched on command, I,e. at arbitrary, not predefined points in time • Dynamic adaptation(need Dynamic/Run-Time Reconfiguration) • The algorithm implementation is adapted at run-time depending on the incoming data. • e.g. neural networks, adaptive filters, constant propagation CAS ICT AN Hong

  24. Reconfigurable Computing Systems • Combine general-purpose processors, FPGAs, memory and interconnect • FPGAs act as programmable processors or co-processors • Interconnect connects the processors and FPGAs • FPGAs have access to multiple levels of memory CAS ICT AN Hong

  25. The Nature of Reconfigurable Computing Systems • System composition(Elements and Architectures) • Programmable logic • Processor or specialized custom components • Run-Time Reconfiguration(RTR): The capability of reprogramming the hardware circuits at run-time • Is a crucial factor for the functionality and the performance of reconfigurable systems • Configuration granularity: the number of partitions that (configurations) consists an algorithm and their size • Depends strongly on the amount of resources that a reconfigurable system provides • Configuration scheduling • Static scheduling and run-time scheduling CAS ICT AN Hong

  26. Application • Software Environment-II: • Development • System • - HDL • Compiler • CAD Tools Software Environment-I: Run-time environment Configuration Management Reconfigurable Computing Engine Reconfigurable Computing SystemsArchitectures A reconfigurable system is much more than some FPGA’s on a board... CAS ICT AN Hong

  27. Reconfigurable Computing Systems Elements • A reconfigurable computing engine • Provides the required programmable logic • A configuration management mechanism • Controls the execution of the tasks on the available reconfigurable resources • The complexity and the form of it widely vary depending on the application and the computation approach • A software environment • The development system is a vital part • A well defined design flow • Compilers: allow easy and fast compilation of high-level software descriptions into hardware circuits • CAD tools: support RC mechanisms like e.g. run-time reconfiguration CAS ICT AN Hong

  28. Hardware Organization - DECPerLe1 (Dec Paris) CAS ICT AN Hong

  29. Hardware Organization - Splash 2(SRC) CAS ICT AN Hong

  30. Organization - CHAMP (Lockheed Sanders) CAS ICT AN Hong

  31. Outside world Organization - SLAAC (ISI, BYU, UCLA, Sandia) • Multiple memories per FPGA • Form factors planned in near term: PCI, PMC CAS ICT AN Hong

  32. Hardware Organization: SLAAC System CAS ICT AN Hong

  33. Software Systems CAS ICT AN Hong

  34. SLAAC Runtime System Goals • Make network transparent • Support heterogeneous systems • Support dynamic task allocation • Support system scaling CAS ICT AN Hong

  35. DISC-PAS: 973 project in ICT and USTC • DISC-PAS: A SoC (Supercomputing on a Chip) system on Reconfigurable Platforms,next generation General-purpose processors • DISC(Distributed Instruction Stream Computer) ISA • C.p. • CISC(Complicated Instruction Set Computer) • RISC(Reduced Instruction Set Computer) • PAS(PolymorphousAdaptive and Scalable) Processors • P( Polymorphous ) • ILP,TLP,MLP,DLP • A( Adaptive ) • Implemented via Reconfigurable and Polymorphous Computing Architecture Technology • S( Scalable ) • System feature CAS ICT AN Hong

  36. 下一代通用处理器体系结构主流面貌:概念 • 2003年:计算所基础研究基金项目 • 称为可扩展自适应多型处理器(Scalable Adaptive Polymorphous Processors, 简称SAPP)芯片体系结构 • 2005年:973项目 • 称为DISC-PAS芯片体系结构 • 特征:多型 + 多核 +分布共享存储+ 资源配置可重构 • 关键支撑技术 • Microarchitecture • Multithreading and microprocessor,Vector • Innovative on-chip memory • Communication and synchronization • Reconfigurable and Polymorphous Computing • ISA and Compiler • Parallel programming • Virtual Machine CAS ICT AN Hong

  37. P P P P P P P P M M M M M M M M 网络处理 (TLP) P P P P P P P P 流水处理(ILP) M M M M M M M M P P P P P P P P 字处理(ILP) M M M M M M M M 向量处理(DLP) P P P P P P P P M M M M M M M M GUI(ILP) 可重构(冗余) 资源 下一代通用处理器体系结构主流面貌:概念 CAS ICT AN Hong

  38. 下一代通用处理器体系结构主流面貌:特征 • 多型性(适应广泛的应用类型):P( Polymorphous ) • ILP,TLP,MLP,DLP • 自适应性(底层硬件粗粒度可重构):A( Adaptive ) • 可编译的结构,有效地匹配结构资源和应用特征,提高资源的利用率 • 多个处理器内核模块的集合:需要什么样的内核? • RISC,VLIW, DSP,ASIC,ASIP,图形芯片? • 大量的片上局部存储器:需要什么样的局部存储器? • 互联和I/O:优化何种并行性?如何平衡结构资源? • 处理器边: DLP(加速多媒体),TLP(加速OLTP),ILP(加速控制流) • 存储器边:部分局部存储器,一组局部存储器 • 可扩展性(模块化/分布化设计):S( Scalable ) • 克服线延迟问题 • 易于实现分布的时钟控制 • 低功耗 • 减少设计/验证/测试时间 CAS ICT AN Hong

  39. 下一代通用处理器体系结构主流面貌:期望 • 性能 • 结构资源(计算,存储和通信)有效地匹配应用特征 • 单芯片性能可扩展性能与目前的专用设计媲美 • ILP>10倍? • TLP/DLP:线性加速比 • 万亿次级芯片 • 可编程性 • 适应所有重要的应用 • 易于使用和编译:在高级语言级写代码,由编译器自动并行化 • 设计实现的成本和复杂性 • 由简单设计(大量复制相同的简单结构)构成复杂设计 • 一次设计,多个实现 • 功率 • 最小化无用功 • 功率均匀分布 • 可靠性 • 冗余设计,提高可靠性 CAS ICT AN Hong

  40. Conclusion • Conclusion • FPGAs provide new opportunities for performance improvement of scientific computing • Hardware features and constraints lead to new design tradeoffs • New Opportunities • Libraries (?) IP Cores • E.g. BLAS library on FPGA-augmented systems • Programming Models/ Tools • Issues • What are the benefits & tradeoffs? • Can we quantify them across a range of applications areas? • What role will reconfigurable computing play in future systems? CAS ICT AN Hong

  41. Conclusion • Challenges • Building Blocks Inadequate • Tough to Program • Minimal Runtime Support • Verification • Innovative Algorithms CAS ICT AN Hong

  42. Building Blocks Inadequate? • Current FPGA devices • ASIC-replacement • single-bit functions (LUT’s) • glue logic • stand-alone devices • slow configuration times • Current platforms • lack of standards/interoperability • Research Issues • Devices • multi-bit functional units • better arithmetic support • faster reconfiguration • integration with processors • Systems • standard platforms CAS ICT AN Hong

  43. Tough to Program? • Research Issues • RTR • CAD support for RTR (design entry? simulation?) • sharing circuitry/state between configurations • Programming • programming models • compilation from HLL’s CAS ICT AN Hong

  44. Minimal Runtime Support? • Research Issues • Look and act like computers • development/debug environments • runtime monitoring/control • programming models • All in the context of compile- and run-time reconfiguration... CAS ICT AN Hong

  45. Verification? • Research Issues • Benchmarking • comparisons to conventional technology • Performance tools • Analysis CAS ICT AN Hong

  46. Algorithms and Applications? HPEC’98 Talks: • Song • A Two Teraops Embedded Mixed-Signal Radar Receiver/Processor • Wallace • The Role of Field Programmable Gate Arrays in Augmenting an Embedded AN/SPY-1 Radar Signal Processor • Pellon • RF Noise Shaping digital Receiver Technology • Lucas • Configurable Micro-Accelerators for Sensor Processing • McCloskey • Reconfigurable Computing for High Performance Embedded Processing Systems CAS ICT AN Hong

  47. Conclusion: New Direction? • General purpose processors (including workstation/server processors, embedded processors, low-power processors, and DSPs) • Instruction-based • Flexibility to execute many varied programs • Fixed hardware resources • ALU, FPU, etc. • Not optimized for any particular application • Fixed memory access and cache replacement policies • Good for control-intensive applications • FPGAs • No instructions • Hardware is configured for a particular application • Parallelism • Multiple functional units of a given type • Better resource utilization than general purpose processor • Can be reconfigured for new application • Memory structure tailored to the application • Good for data-intensive applications CAS ICT AN Hong

More Related