150 likes | 282 Vues
This presentation provides an in-depth look at scalable vector processors tailored for embedded systems, focusing on low power consumption and performance scalability. It introduces the Vector IRAM (VIRAM) instruction set, along with its coprocessor extension to MIPS architecture. Key topics include the design of the compiler using advanced vectorization techniques and the evaluation of a clustered processor design for enhanced data-level parallelism. The approach aims to optimize area, power, and performance, making it suitable for multimedia and telecommunications applications.
E N D
Scalable Vector Processors for Embedded Systems Kozyrakis, Patterson Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures
Outline • Introduction • Instruction Set • Compiler • The Design • Evaluation • Clustered Processor • Conclusion
Introduction • Embedded processors requires low power and complexity • Performance and scalability are primary • Superscalar and VLIW (ILP) • Superscalar requires complex hardware to detect dependence • VLIW requires a very through compiler • Scaling is difficult
Introduction • Multimedia and telecommunications have data Level Parallelism (DLP) • Revise vector architecture for supercomputers • Introduce Vector IRAM (VIRAM)
Instruction Set • Coprocessor extension to MIPS • Vector Register File (VRF) • 32 Registers • Integer and floating point • Flag register • Vector operations • Arithmetic: integer and floating point • Logical operations • Other functions e.g. population count
Instruction Set • Supports three common access patterns and virtual addressing • Elements can be 64, 32 or 16 bit wide • The 64-bit datapath can execute multiple narrow elements • Element permutation is limited to dot product and fast Fourier transforms • Supports speculative execution using the flag register
The Compiler • Based on PDGCS compilation system for Cray supercomputers • Extensive vectorization techniques: • Outer-loop vectorization • Handling partially vectorizable constructs • Does not require special functions nor custom libraries • Requires pragmas for irregular scatter/gather patterns
The Compiler • Selects operation and element width • Recognizes reduction
The Design • Coprocessor to 64-bit MIPS • VRF capacity is 8KB • Can be 32-64-bit, 64 32-bit or 128 16-bit • A lane has 2 64-bit ALU and vector load/store unit • On-chip 13 MB DRAM organized as 8 banks • The scalar core is a single issue in order MIPS
The Design • Operates at 200MHZ with 2W power consumption
Clustered Processor • VIRAM has complex VRF • Approx. 3 ports per FU • Proposed: replace centralized VRF with clustered VRF • A cluster has a datapath for one FU and few vector registers • It contains access to intercluster network • Area, power and latency per cluster is constant
Clustered Processor • Renaming is used to utilize clustered configuration • It is done using a renaming table that identifies the source and destination • It can be used to implement more than 32 registers • Clustering improves scaling
Conclusion • Designed for embedded systems • Area, power and performance • Exploits DLP • Instruction set VRF • Vectorizing compiler • Evaluation • Clustered configurtaion