林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 2.0.1 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Outline • Abstract • Introduction • NoC Architecture • Encoder Task Graph • Task Profiling • App. Perform on NoC • App. Mapping on Processors • Results Analysis • Conclusion

Abstract • Networks on Chip (NoCs) are commonly used to integrate complex embedded systems and multiprocessor platforms due to their scalability and versatility. • modeling tools used to describe such architectures at the functional level • co-design and error correction is now performed concurrently • This work utilizes a JPEG encoder and maps it onto a cofigurable MN NoC architecture that implements Message Passing Interface (MPI)communication between cores.

Introduction • Complexity, scalability and portability are becoming essential topics to be solved when designing digital systems nowadays. • Whilst advances in fabrication technology have allowed embedded platforms to integrate a high amount of hardware resources • the technology to intercommunicate them has been moving from typical hierarchical bus connections into network-based solutions called Network On Chip. • To ease and optimize information in Many-Core architectures, one way to interconnect cores is through networks. • There are also challenges when designing NoCs, both in the HW/SW fields: • Regarding HW, considerations related to topology, router architecture and network interface structure, can lead to considerably different results depending on the design. • On the SW side, the main obstacle is to define the programming model for the NoC-based system, as both shared and distributed memory approaches have their drawbacks. • This paper found the distributed memory model more suitable for a network-based architecture and decided to use it with a message passing structure as the Message Passing Interface (MPI). • The MPI approach allows performing several mappings with little programming effort.

NoC Architecture (1/2) • The core of the NoCis composed by routers and network interface cards (NIC) • routers are in charge of delivering the information in form of packets (flits) from source to destination; • network cards receive transactions from end-modules, translate them into flits and send them to the router's network for distribution. • Define router model with the following structure: • Switching Technique: Wormhole packet-based. • Routing Algorithm: Either XY, West-First or North-Last. • Flow Control: Handshaking ACK/NACK signals. • Virtual Circuits: Four at each input; one per output port. Variable depth. • Link width: 32 bits. • Output Arbitration: Round-Robin.

NoC Architecture (2/2) • As the application has to be written in MPI, all calls to mpi_send() on one core, must match one mpi_receive() on another. • End-to-end flow control is handled as: • Call to mpi_send(): The core notifies the NIC to start packing data and keep it on a local buffer ready to be sent. • Call to mpi_receive(): The core asks the NIC to send a data-request message (1 flit long) to the corresponding address so that the transfer starts. • A timer is set to re-send the request after a while if no data is received. Fig. 1. NoCparameterizable proposed architecture.

Encoder Task Graph • In order to obtain a detailed and optimized functional partitioning, a task graph was created to identify parallelism and temporal dependence. Fig. 2. JPEG Encoding Algorithm Task Graph.

Task Profiling • Some criteria is needed before mapping each task to the NoC platform, therefore, a profiling for each one is suggested to identify heavy computations and algorithm bottlenecks. • Associated cost were assigned to measure processor time • 1 time unit for sums, loads, stores and logical operations • 2 time units for multiplications and divisions • For fixed tasks such as the RGB to YUV: • for DCTand quantization, it is possible to estimate the number of operations • for encodingand bit-stream writing, theyare block-depending operations and their computing cost will depend on the amount of redundant information of the image. Table 1. Aver. cost of the JPEG encoding Alg (per iteration).

App. Perform on NoC • This work bases on the task graph and profiling to perform different mappings of the JPEG encoding application to the NoC to analyze its performance. • Each of the listed tasks was manually assigned to the processing units according to the cost.

App. Mapping on Processors (1/4) • Three parallel branches compose the JPEG encoding: • DCT • quantization • Huffman encoding • There is also sequential behavior occurring at 2 points: • RGB to YUV • bit-stream file writing • the mappings is shown in Fig. 3, on a 22, 32 and 33 NoCwere proposed for evaluation. Fig. 3. JPEG Encoder Evaluated Mappings. Tests were carried on with 4, 6 and 8 processors. Each processor computes one of the tasks shown in Fig.2 for specific image components.

App. Mapping on Processors (2/4) • A simulation was conducted for each mapping with a 512512 BMP image. • The parameters set during simulation were: mesh topology, XY-routing, virtual circuit depth 2~10 and network speed half the processor's. • In all cases, the effect of increasing VC depth, slightly reduces execution time for the algorithm, • implies that, for the proposed router architecture, a depth of 2 flits on each virtual circuit, is more than enough. Fig. 4. JPEG encoder performance on mesh NoCs with XY-Routing, 2 Flits/VC and network speed equal to half the processors's one. Changes in router parameters, as routing algorithm, topology and VC depth, don't yield significant improvements.

App. Mapping on Processors (3/4) • In order to analyse the impact of synthesis technology for the NoC, router's and NIC's speed was lowered to -3X and -4X • X is the processor' speed. • From fig. 5, the mapping appropriately improves the encoding when • for 6 and 8 processors and the network is 3X slower • for 8 processor and the network is 4X slower. Fig. 5. JPEG relative performance for network speeds -3X & -4X (X is processor speed). Image size was 512x512 pixels.

App. Mapping on Processors (4/4) • In order to generalize the all results, a final simulation was performed with different image sizes, see Fig. 6 • For the proposed task partitioning and mapping, the gain with 4 processors is around 24-25%, with 6 around 45-46% and with 8, 49-50%, irrespective of the image size. Fig. 6. Application performance for different image sizes.

Results Analysis • There is one consistent behavior on the previous subsection: performance (execution time) increases with the number of cores. • From Fig.4, • the gain obtained by increasing 1 to 4 and 4 to 6 processors is around 25~27% each • the enhancement acquired from 6 to 8 cores is only 8~9%, but the area cost is very high. • Even though an attempt to cover most significant simulation aspects at high level was done, it's not clear what criteria should be consider as better: • latency, execution time, computation/communication rate, traffic distribution, area consumption, … etc. • There is no single criteria to solve such a crossroad • only design restrictions and specificationsmight provide a guide to get to a satisfactory answer.

Conclusion • It was possible to correctly validate at the functional and architectural level. • several simulations were executed in short time and allowed performing numerous analysis. • The previously results provide the designer with an overview of the amount of variables. • The variables are that have to be taken into account when dealing with multi-processor platforms on NoC structures.

林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C