1 / 27

Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions

Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions. Outline. Heterogeneous MPSoCs Specialization is a growing trend Accelerator-rich MPSoC architecture MPSoCs with many accelerators Previous works

Télécharger la présentation

Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions

  2. Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous works • Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation

  3. Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation

  4. Heterogeneous MPSoCs • Heterogeneous MPSoCs • Integrated solutions for a group of evolving markets • ILP (e.g. CPU, DSP, or even GPU) • Flexibility • - Power dissipation • Custom-HW Accelerators (ACCs) for compute-intensive kernels • Power efficiency • Cost • Inflexibility • What is the trend?

  5. Specialization as a MPSoC trend • Increasing demands for highperformance low power computing • Market examples: • Embedded vision • Software Define Radio (ADR) • Cyber Physical Systems (CPS) • Tens billion of operations per second • Less than few watts power -Trend: Domain specific specialization • Proliferating number of ACCs in systems • ACC-Rich MPSoC

  6. Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation

  7. Principals of current accelerator-rich MPSoC 1. Input Done 2.DMA Start 3.DMA Done 4.DMA Start 5.DMA Done 6.ACC1 Start 7.ACC1 Done 8.DMA Start 9.DMA Done • 10.DMA Start • 11.DMA Done 12.Output start 13-Output Done • ILP+HWACC composition • HW-ACC • Executes Compute-intense kernels/apps • ILP • Executes remaining applications • Orchestrates HWACCs / coordinate data movement • On-chip scratchpad memory (SPM) • Keeps data between ILP and ACCs on-chip • Avoid costly off-chip memory access

  8. MPSoC with many accelerators • Control and interrupt lines • - ACC configuration • Centralized vs. dedicated DMA • - Stream data transfer DMA DMA DMA DMA DMA DMA SPM: Scratch Pad Memory • Scratch Pad Memory (SPM) • 2 per accelerator , 1 per I/O • To hold input job

  9. Challenges with increasing number of interrupts NEED to quantitatively consider this architecture! 1- Memory requirement - Two SPM per each ACC • One SPM per each Interfaces • Shared memory to hold data handed over the accelerators 2- High volume of traffic over system fabric - No point to point connections between ACC - Required DMA data transfers 3-ILP synchronization - Among accelerators, IO Interfaces and DMA transfers

  10. Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation

  11. Previous works on composing ACC • Composing bigger applications out of many accelerates like Accelerator-Rich CMPs[1], CHARM[2] • Imposing a considerable traffic and considerable on-chip buffers for accelerator data exchange • ILP load to orchestrate the system composed of accelerators • J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, 2012. • M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high-performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb. 2010.

  12. Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation

  13. Quantitative exploration of accelerator-rich MPSoC; WHY and HOW • Applicability of quantitative exploration • Quantifying the potential challenges • Exposing the ACC-rich bottlenecks as # of ACCs increases • Helping system architects for proper sizing of systems knobs (SPM sizes, # of ACCs, Communication BW) • Motivating our proposed arch-template solution • Approaches of quantitative exploration 1- First order mathematic based analysis 2- Simulation based analysis of ACC-rich MPSoC

  14. Exploration overview • Assumptions • One HD resolution frame as input • Divided into smaller jobs • Memory on chip • Avoid off-chip memory for now • Exploration steps • Memory requirement as #ACC increases • Sizing SPM to satisfy memory budget limitation • Interrupt rate load on ILP

  15. Memory size analysis (calculation based) • Memory size = SPMs + shared memory • SPM holds one job • Job size determines minimum size of SPM and shared memory • Shared memory holds all jobs exchanged among ACCs • More ACCs requires larger memory • Bigger job needs larger memory Limiting memory budget • Sizing job size with respect to memory budget

  16. Job sizing (calculation based) • Count the number of interrupts • Measure ILP responsibility to response Interrupts • Smaller job size issues more interrupts to ILP • - Responsibility of ILP to synchronize ACCs transactions • The lower the size of memory, the smaller the size of job • The more #accelerators, the smaller job size

  17. Simulation platform SCE refinement • Using SpecC SLDL to develop a simulation model • Scalable # of ACCs • Different/same data rate • ILPs • DMAs • Mummeries (SPM, shared memory) • On-chip and off-chip memory • Generating ACC-Rich simulation model • BFM AMBA-AHB Communication fabric • ARM 9 (ISA v6) for ILP execution • Priority based • Dedicated interrupt line • Centralized DMAs

  18. # of interrupt by scaling #ACC (simulation based) Smaller memory/more ACCs -> smaller Job • More interrupts to the ILP with smaller job size • - Significant utilization or even over saturation of ILP only because of driving accelerators • # of interrupt vs. the number of accelerators • For different size of on-chip memory

  19. Communication overhead analysis (calculation based) • Communication overhead = data exchanged through the system fabric • More ACCs, heavier traffic on system fabric

  20. Exploration Summery • Problems affiliated with current accelerator-rich architecture • On-chip memory requirements • ILP synchronization load • Heavy communication traffic on system fabric • Demands toward improved ACC-centric design • Tackling the challenges of current ACC-rich architecture

  21. Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation

  22. The goals of the proposed ACC-centric arcitecture • The proposed solution • An autonomous accelerator chain • Relieving ILP’s synchronization load • Point to point connections between accelerators • No need for larger SPM per each accelerator • No frequent DMA data transfers • No heavy traffic on system fabric

  23. Simulation platform SCE refinement • Modifying the developed SpecC model to support autonomous chain of accelerator • Gateways to manage the chain • Creating another ACC-Rich simulation model • BFM AMBA-AHB Communication fabric • ARM 9 (ISA v6) for ILP execution • Dedicated interrupt line from gateways to ILP • Centralized DMA

  24. The proposed accelerator-centric architecture template • Point to point accelerator connections • No much memory requirement • Not many DMA data transfer • Autonomous ACC chain: • Light ILP synchronization load no matter how many accelerators 1. DMA brings data to the input gateway’s SPM 2. Input gateway receives data and starts to pass data through the chain 3. Chain works on data 4. Output gateway gathers data in SPM 5. DMA brings data to memory 3 2 4 1 5 • Gateways controlled by ILP to manage the whole chain of accelerators • SPM to receive/send data from/to memory • Control lines from ILP to gateways for configuration • Interrupt lines from gateways to ILP • Point to point connections in chain with small buffer in between • Chain works independence of ILP

  25. Evaluation • MORE ACC: • Current arch: Smaller job • Proposed arch: almost the same job • MORE ACC: • Current arch: Linear growth in memory requirement • Proposed arch: almost constant memory requirement • MORE ACC: • Current arch: Heavier traffic • Proposed arch: almost the same data traffic • MORE ACC: • Current arch: exponential growth in interrupts • Proposed architecture: The same number of interrupts

  26. Summary • Specialization as a growing trend in CMPs • Accelerator rich architectures • Exploration of the challenges in current accelerator rich architecture • Memory requirement • Communication overhead • Synchronization load • The proposed accelerator-centric architecture template • Autonomous accelerator chain • No large memory requirement • No heavy communication traffic • No critical amount of required synchronization

  27. Question? Again, Thanks to Professor Schirnerfor all his support… Thanks to Hamedfor what I’ve been learning from him, Thank you all ESL members for your attendance!

More Related