1 / 71

Executive Summary

Executive Summary. Overall Architecture of ARC. Architecture of ARC Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC routers between multiple accelerators. GAM. Accelerator + DMA+SPM. Shared Router. Shared L2 $. Core. Memory controller.

rhian
Télécharger la présentation

Executive Summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Executive Summary

  2. Overall Architecture of ARC • Architecture of ARC • Multiple cores and accelerators • Global Accelerator Manager (GAM) • Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Shared L2 $ Core Memory controller

  3. What are the Problems with ARC? • Dedicated accelerators are inflexible • An LCA may be useless for new algorithms or new domains • Often under-utilized • LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM • Unused when the accelerator is unused • We want flexibility and better resource utilization • Solution: CHARM • Private SPM is wasteful • Solution: BiN

  4. A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] • Motivation • Great deal of data parallelism • Tasks performed by accelerators tend to have a great deal of data parallelism • Variety of LCAs with possible overlap • Utilization of any particular LCA being somewhat sporadic • It is expensive to have both: • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism • Overlap in functionality • LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) • Idea • Flexible accelerator building blocks (ABB) that can be composed into accelerators • Leverage economy of scale

  5. Micro Architecture of CHARM • ABB • Accelerator building blocks (ABB) • Primitive components that can be composed into accelerators • ABB islands • Multiple ABBs • Shared DMA controller, SPM and NoC interface • ABC • Accelerator Block Composer (ABC) • To orchestrate the data flow between ABBs to create a virtual accelerator • Arbitrate requests from cores • Other components • Cores • L2 Banks • Memory controllers

  6. CAMEL • What are the problems with CHARM? • What if new algorithm introduce new ABBs? • What if we want to use this architecture on multiple domains? • Adding programmable fabric to increase • Longevity • Domain span

  7. CAMEL • What are the problems with CHARM? • What if new algorithm introduce new ABBs? • What if we want to use this architecture on multiple domains? • Adding programmable fabric to increase • Longevity • Domain span • ABC is now responsible to allocate programmable fabric

  8. Extensive Use of Accelerators • Accelerators provide high power-efficiency over general-purpose processors • IBM wire-speed processor • Intel Larrabee • ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022 • Two kinds of accelerators • Tightly coupled – part of datapath • Loosely coupled – shared via NoC • Challenges • Accelerator extraction and synthesis • Efficient accelerator management • Scheduling • Sharing • Virtualization … • Friendly programming models

  9. Architecture Support for Accelerator-Rich CMPs (ARC) [DAC’2012] • Motivation • Managing accelerators through the OS is expensive • In an accelerator rich CMP, management should be cheaper both in terms of time and energy • Invoke “Open”s the driver and returns the handler to driver. Called once. • RD/WR is called multiple times. Accelerator CPU OS Accelerator Manager App

  10. Overall Architecture of ARC • Architecture of ARC • Multiple cores and accelerators • Global Accelerator Manager (GAM) • Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Shared L2 $ Core Memory controller

  11. Overall Communication Scheme in ARC CPU GAM 1 LCA Memory The core requests for a given type of accelerator (lcacc-req).

  12. Overall Communication Scheme in ARC CPU GAM 2 LCA Memory The GAM responds with a “list + waiting time” or NACK

  13. Overall Communication Scheme in ARC CPU GAM 3 LCA Memory The core reserves (lcacc-rsv) and waits.

  14. Overall Communication Scheme in ARC CPU GAM 4 4 LCA Memory The GAM ACK the reservation and send the core ID to accelerator

  15. Overall Communication Scheme in ARC CPU GAM 5 5 Accelerator Memory Task description • The core shares a task description with the accelerator through memory and starts it (lcacc-cmd). • Task description consists of: • Function ID and input parameters • Input/output addresses and strides

  16. Overall Communication Scheme in ARC CPU GAM 6 LCA 6 Memory Task description • The accelerator reads the task description, and begins working • Overlapped Read/Write from/to Memory and Compute • Interrupting core when TLB miss

  17. Overall Communication Scheme in ARC CPU GAM 7 LCA Memory Task description When the accelerator finishes its current task it notifies the core.

  18. Overall Communication Scheme in ARC CPU GAM 8 LCA Memory Task description The core then sends a message to the GAM freeing the accelerator (lcacc-free).

  19. Accelerator Chaining and Composition • Chaining • Efficient accelerator to accelerator communication • Composition • Constructing virtual accelerators Accelerator1 Accelerator2 Scratchpad Scratchpad DMA controller DMA controller 3D FFT virtualization M-point1D FFT M-point1D FFT M-point1D FFT M-point1D FFT N-point2D FFT

  20. Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

  21. Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 1: 1D FFT on Row 1 and Row 2

  22. Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 2: 1D FFT on Row 3 and Row 4

  23. Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 3: 1D FFT on Col 1 and Col 2

  24. Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 4: 1D FFT on Col 3 and Col 4

  25. Light-Weight Interrupt Support CPU GAM LCA

  26. Light-Weight Interrupt Support CPU GAM Request/Reserve Confirmation and NACK Sent by GAM LCA

  27. Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA

  28. Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on

  29. Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on Why Logical Address? 1- Accelerators can work on irregular addresses (e.g. indirect addressing) 2- Using large page size can be a solution but will effect other applications

  30. Light-Weight Interrupt Support CPU GAM It’s expensive to handle the interrupts via OS LCA

  31. Light-Weight Interrupt Support CPU LWI GAM Extending the core with a light-weight interrupt support LCA

  32. Light-Weight Interrupt Support CPU LWI GAM Extending the core with a light-weight interrupt support LCA • Two main components added: • A table to store ISR info • An interrupt controller to queue and prioritize incoming interrupt packets • Each thread registers: • Address of the ISR and its arguments and lw-int source • Limitations: • Only can be used when running the same thread which LW interrupt belongs to • OS-handled interrupt otherwise

  33. Programming interface to ARC Platform creation Application Mapping & Development

  34. Evaluation methodology • Benchmarks • Medical imaging • Vision & Navigation

  35. Application Domain: Medical Image Processing compressive sensing • reconstruction total variational algorithm • denoising fluid registration • registration level set methods • segmentation Navier-Stokesequations • analysis

  36. Area Overhead • AutoESL (from Xilinx) for C to RTL synthesis • Synopsys for ASIC synthesis • 32 nm Synopsys Educational library • CACTI for L2 • Orion for NoC • One UltraSparcIIIi core (area scaled to 32 nm) • 178.5 mm^2 in 0.13 um (http://en.wikipedia.org/wiki/UltraSPARC_III)

  37. Experimental Results – Performance(N cores, N threads, N accelerators) Performance improvement over SW only approaches: on average 168x, up to 380x Performance improvement over OS based approaches: on average 51x, up to 292x

  38. Experimental Results – Energy (N cores, N threads, N accelerators) Energy improvement over SW-only approaches: on average 241x, up to 641x Energy improvement over OS-based approaches: on average 17x, up to 63x

  39. What are the Problems with ARC? • Dedicated accelerators are inflexible • An LCA may be useless for new algorithms or new domains • Often under-utilized • LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM • Unused when the accelerator is unused • We want flexibility and better resource utilization • Solution: CHARM • Private SPM is wasteful • Solution: BiN

  40. A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] • Motivation • Great deal of data parallelism • Tasks performed by accelerators tend to have a great deal of data parallelism • Variety of LCAs with possible overlap • Utilization of any particular LCA being somewhat sporadic • It is expensive to have both: • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism • Overlap in functionality • LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) • Idea • Flexible accelerator building blocks (ABB) that can be composed into accelerators • Leverage economy of scale

  41. Micro Architecture of CHARM • ABB • Accelerator building blocks (ABB) • Primitive components that can be composed into accelerators • ABB islands • Multiple ABBs • Shared DMA controller, SPM and NoC interface • ABC • Accelerator Block Composer (ABC) • To orchestrate the data flow between ABBs to create a virtual accelerator • Arbitrate requests from cores • Other components • Cores • L2 Banks • Memory controllers

  42. An Example of ABB Library (for Medical Imaging) Internal of Poly

  43. Example of ABB Flow-Graph (Denoise) 2

  44. Example of ABB Flow-Graph (Denoise) 2 - - - - - - * * * * * * + + + + + sqrt 1/x

  45. Example of ABB Flow-Graph (Denoise) 2 - - - - - - * * * * * * + + + ABB1: Poly + + ABB2: Poly sqrt ABB3: Sqrt 1/x ABB4: Inv

  46. Example of ABB Flow-Graph (Denoise) - - - - - - 2 * * * * * * + + + ABB1:Poly + + ABB2: Poly sqrt ABB3: Sqrt 1/x ABB4: Inv

  47. LCA Composition Process ABB ISLAND1 ABB ISLAND2 Core x x y w ABC ABB ISLAND3 ABB ISLAND4 z y w z

  48. LCA Composition Process • Core initiation • Core sends the task description: task flow-graph of the desired LCA to ABC together with polyhedral space for input and output x ABB ISLAND1 ABB ISLAND2 Core y z x x Task description y w ABC ABB ISLAND3 ABB ISLAND4 z y 10x10 input and output w z

  49. LCA Composition Process • Task-flow parsing and task-list creation • ABC parses the task-flow graph and breaks the request into a set of tasks with smaller data size and fills the task list ABB ISLAND1 ABB ISLAND2 Core x x y w ABC generates internally ABC ABB ISLAND3 ABB ISLAND4 z y • Needed ABBs: “x”, “y”, “z” w z • With task size of 5x5 block, ABC generates 4 tasks

More Related