Design and Architecture of Eclipse CPU for Irregular Local Processing

ECLIPSEExtended CPU Local Irregular Processing Structure IST E. van Utteren DS & PC A. van Gorkum IC Design G. Beenker LEP, HVE T. Doyle IPA W.J. Lippmann ESAS A. van der Werf IT E. Dijkstra ViPs G. Depovere DD&T C. Niessen AV & MS Th. Brouste PROMMPT J.T.J. v. Eijndhoven ECLIPSE CPU Jos.van.Eijndhoven@philips.com CRB 1992-412

DVP: design problem Nexperia media processors ?

DVP: application domain • High volume consumer electronics productsfuture TV, home theatre, set-top box, etc. • Media processing:audio, video, graphics, communication

DVP: SoC platform • Nexperia line of media processors for mid- to high-end consumer media processing systems is based on DVP • DVP provides template for System-on-a-Chip • DVP supports families of evolving products • DVP is part of corporate HVE strategy

DVP: system requirements • High degree of flexibility, extendability and scalability • unknown applications • new standards • new hardware blocks • High level of media processing power • hardware coprocessor support

DVP: architecture philosophy • High degree of flexibility is achieved by supporting media processing in software • High performance is achieved by providing specialized hardware coprocessors • Problem: How to mix & match hardware based and software based media processing?

A C Process B Read FIFO Execute Write DVP: model of computation Model of computation is Kahn Process Networks: • The Kahn model allows ‘plug and play’: • Parallel execution of many tasks • Configures different applications by instantiating and connecting tasks • Maintains functional correctness independent of task scheduling issues • TSSA: API to transform C programs into Kahn models

DVP: model of computation Application - parallel tasks - streams Mapping - static Architecture - programmable graph CPU coproc1 coproc2

DVP: architecture philosophy • Kahn processes (nodes) are mapped onto (co)processors • Communication channels (graph edges) are mapped onto buffers in centralized memory • Scheduling and synchronization (notification & handling of empty or full buffers) is performed by control software • Communication pattern between modules (data flow graph) is freely programmable

DVP: generic architecture • Shared, single address space, memory model • Flexible access • Transparent programming model • Physically centralized random access memory • Flexible buffer allocation • Fits well with stream processing • Single memory-bus for communication • Simple and cost effective

DVP: example architecture instantiation SDRAM Serial I/O video-in PCI bridge video-out timers I2C I/O audio-out I$ VLIW cpu audio-in D$ I$ Imagescaler MIPS cpu D$

DVP: TSSA abstraction layer TSSA-OS TSSA-Appl1 TSSA-Appl2 TM-CPU software Traditional coarse-grain TM co-processors TSSA stream data, buffered in off-chip SDRAM, synchronization with CPU interrupts

DVP: TSSA abstraction layer • Hides implementation details: • graph setup • buffer synchronization • Runs on pSOS (and other RTKs) • Provides standard API • Defines standard data formats

Outline • DVP • Eclipse DVP subsystem • Eclipse architecture • Eclipse application programming • Simulator • Status

Eclipse DVP subsystem Objective Increase flexibility of DVP systems, while maintaining cost-performance. Customer • Semiconductors: Consumer Systems (Transfer to TTI) • Consumer Electronics: Domain 2 (BG-TV Brugge) • Research Products Mid- to high-end DVP / TSSA systems: DTVs and STBs

Eclipse DVP subsystem: design problem SDRAM • Increase application flexibility through re-use of medium-grain function blocks, in HW and SW • Keep streaming data on-chip But ? • More bandwidth visible • Limited memory size • High synchronization rate • CPU unfriendly HDVO condor MPEG CPU DVP/TSSA system: • Coarse-grain ‘solid’ function blocks(reuse, HWSW ?) • Stream data buffered in off-chip memory(bandwidth, power ?)

VO DVDdecode Eclipse CPU MPEG2decode MPEG2 encode Design problem: new DVP subsystem 1394 external memory CPU

Eclipse DVP subsystem: application domain Now, target for 1st instance: • Dual MPEG2 full HD decode (1920 x 1080 @ 60i) • MPEG2 SD transcoding and HD decoding Anticipate: • Range of formats (DV, MJPEG, MPEG4) • 3D-graphics acceleration • Motion-compensated video processing

Application domain: MPEG2 decoding (HD)

Application domain: MPEG2 encoding (SD)

Application domain: MPEG-4 video decoding Reference Pictures Reference Pictures Reference Pictures Reference Pictures Shape Motion Compensation <220 Context Arithmetic Decoding Shape MV Prediction 0.1 90 90 Picture Reconst. MV Decoder Motion Comp. 90 MPEG-4 ES 128 90 Variable Length Decoding Inverse Scan Inverse Quantization IDCT 90 <384 800 90 DC & AC Prediction <7

Sandra Eclipse CPU MPEG-4: system level application partitioning Composition and rendering Scene description Audioobject Videoobject 3D Gfxobject Decompression De-multiplex Network layer

D$ MediaCPU I$ MPEG-4: partitioning Eclipse - SANDRA SDRAM MMI VO(SANDRA) VI VLD SRAM DCT MBS MC Eclipse

Eclipse DVP subsystem: current TSSA style TSSA TSSA-Appl1 TSSA-Appl2 TM-CPU software Traditional coarse-grain TM co-processors TSSA stream data, buffered in off-chip SDRAM, synchronization with CPU interrupts

Eclipse DVP subsystem: Eclipse tasks embedded in TSSA TSSA TSSA task on DVP HW TSSA-Appl1 TSSA-Appl2 TSSA task in SW TSSA task on Eclipse Eclipse task on HW Eclipse task in SW TSSA data streamvia off-chip memory Eclipse data streamvia on-chip memory EclipseDriver

Eclipse DVP subsystem: scale down Hierarchy in the DVP system: • Computational model which fits neatly inside DVP & TSSA Scale down from SoC to subsystem: • Limited internal distances • High data bandwidth and local storage • Fast inter-task synchronization

Outline • DVP • Eclipse DVP subsystem • Eclipse architecture • Model of computation • Generic architecture • Eclipse application programming • Simulator • Status

Eclipse architecture: model of computation Application - parallel tasks - streams Mapping - static Architecture - programmable - medium grain - multitasking CPU coproc1 coproc2

Model of computation: architecture philosophy The Kahn model allows ‘plug and play’: • Parallel execution of many tasks • Application configuration by instantiating and connecting tasks. • Functional correctness independent of task scheduling issues. Eclipse is designed to accomplish this with: • A mixture of HW and SW tasks. • High data rates (GB/s) and medium buffer sizes (KB). • Re-use of co-processors over applications through multi-tasking • Runtime application reconfiguration.

Allow proper balance in HW/SW combination Function-specific engines High Eclipse Energy efficiency DSP-CPU Low Low High Application flexibility of given silicon

Previous Kahn style architectures in PRLE CPA C-Heap Explicit synchronization Shared memory model Mixed HW/SW Data driven HW synchronization Multitasking coprocs Eclipse But ? Dynamic applications CPU in media processing But ? High performance Variable packet sizes

Outline • DVP • Eclipse DVP subsystem • Eclipse architecture • Model of computation • Generic architecture • Coprocessor shell interface • Shell communication interface • Architecture instantiation • Eclipse application programming • Simulator • Status

Generic architecture: inter-processor communication • On-chip, dedicated network for inter-processor communication: • Medium grain functions • High bandwidth (up to several GB/s) • Keep data transport on-chip • Use DVP-bus for off-chip communication only

Communication network Generic architecture: communication network CPU Coprocessor Coprocessor

Generic architecture: memory • Shared, single address space, memory model • Flexible access • Software programming model • Centralized wide memory • Flexible buffer allocation • Fits well with stream processing • Single wide memory-bus for communication • Simple and cost effective

Generic architecture: shared on-chip memory CPU Coprocessor Coprocessor Communication network Memory

Generic architecture: task level interface Partition functionality between application-dependent core and generic support. • Introduce the (co-)processor shell: • Shell is responsible for application configuration, task scheduling, data transport and synchronization • Shell (parameterized) micro-architecture is re-used for each coprocessor instance • Allow future updates of communication network while re-using (co-)processor core design • Implementations in HW or SW

Computation layer Generic support layer Shell-SW Shell-HW Communication network layer Generic architecture: layering CPU Coprocessor Coprocessor Task-level interface Shell-HW Shell-HW Communication interface Communication network Memory

Task level interface: five primitives Multitasking, synchronization, and data transport: • int GetTask( location, blocked, error, &task_info) • bool GetSpace ( port_id, n_bytes) • Read( port_id, offset, n_bytes, &byte_vector) • Write( port_id, offset, n_bytes, &byte_vector) • PutSpace ( port_id, n_bytes) GetSpaceis used for bothget_dataandget_roomcalls. PutSpaceis used forbothput_dataandput_roomcalls. The processor has the initiative, the shell answers.

a: Initial situation of ‘data tape’ with current access point: Task level interface: port IO Task A b: Inquiry action provides window on requested space: n_bytes1 c: Read/Write actions on contents: offset d: Commit action moves access point ahead: n_bytes2

Empty space Granted window for writer A B Granted window for reader Space filled with data Task level interface: communication through streams Kahn model: Task A Task B Implementation with shared circular buffer: The shell takes care that the access windows have no overlap

Task level interface: multicast Task B Forked streams: Task A Task C The task implementations are fixed (HW or SW).Application configuration is a shell responsibility. Empty space Granted window for writer C Granted window for reader C A B Granted window for reader B Space filled with data

Task level interface: characteristics • Linear (fifo) synchronization order is enforced • Random access read/write inside acquired window through offset argument • Shells operate on unformatted sequences of bytesAny semantical interpretation is left to the processor • A task is not aware of where its streams connect to,or other tasks sharing the same processor • The shell maintains the application graph structure • The shell takes care of: fifo size, fifo memory location, wrap-around addressing, caching, cache coherency, bus alignment

Task level interface: multi-tasking • Non-preemptive task scheduling • Coprocessor provides explicit task-switch moments • Task switches separate ‘processing steps’(Granularity: tens or hundreds of clock cycles) • Shell is responsible for task selection and administration • Coprocessor provides feedback to the shell on task progress int GetTask( location, blocked, error, &task_info)

Computation layer Shell-SW Shell-HW Communication network layer Generic architecture: generic support CPU Coprocessor Coprocessor Task-level interface Shell-HW Shell-HW Generic support layer Communication interface Communication network Memory

Generic support: the Shell The shell takes care of: • The application graph structure, supporting run-time reconfiguration • The local memory map and data transport(fifo size, fifo memory location, wrap-around addressing, caching, cache coherency, bus alignment) • Task scheduling and synchronization The distributed implementation: • Allows fast interaction with local coprocessor • Creates a scalable solution

Generic support: synchronization • PutSpace and GetSpace return after local update or inquiry. • Delay in messaging does not affect functional correctness. Coprocessor A Coprocessor B PutSpace( port, n ) GetSpace( port, m ) Shell Shell m  space space – = n space + = n Message: putspace( gsid, n ) Communication network

Generic support: application configuration Coprocessor Shell tables are accessible through a PI-bus interface Shell Stream table Task table addr size space gsid . . . budget . . . info str_id Task_id Stream_id Communication network

Generic support: data transport caching • Translate byte-oriented coprocessor interface to wide and aligned bus transfers. • Separated caches for read and write. • Direct mapped: two adjacent words per port • Coherency is enforced as side-effect of GetSpace and PutSpace • Support automatic prefetching and preflushing

Generic support: cache coherency

Design and Architecture of Eclipse CPU for Irregular Local Processing

Design and Architecture of Eclipse CPU for Irregular Local Processing

Presentation Transcript

The Central Processing Unit (CPU)

CH12 CPU Structure and Function

Central Processing Unit(CPU)

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Core Processing Unit (CPU)

The Central Processing Unit (CPU)

CPU ( Central Processing Unit )

Hardware structures – Central Processing Unit (CPU).

Central Processing Unit (CPU)

CPU Structure and Function

Central Processing Unit (CPU)

Central Processing Unit (CPU)

CPU Central Processing Unit

CPU (Central Processing Unit)

Central Processing Unit (CPU)

Lecture on Central Processing Unit (CPU)

Chapter 12 CPU Structure and Function

Multi-CPU Video Processing