Challenges of Using Embedded MPI for Hardware Processing Nodes

The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1 1Department of Electrical and Computer Engineering University of Toronto 2Arches Computing Systems, Toronto, Canada

Outline Background and Motivation Embedded Processor-Based Optimizations Hardware Engine-Based Optimizations Conclusions and Future Work

Motivation Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

Motivation Processor 1 Processor 2 Memory Memory Problem: sum of numbers from 1 to 100 for (i = 1; i <= 100; i++) sum += i; Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems

Motivation • Strong interest in adapting MPI for embedded designs: • Increasingly difficult to interface heterogeneous resources as FPGA chip size increases • MPI provides key benefits: • Unified protocol • Low weight and overhead • Abstraction of end points (ranks) • Easy prototyping

Motivation

Motivation • Interaction classes arising from heterogeneous designs: • Class I: Software-software interactions • Collections of embedded processors • Thoroughly investigated; will not be discussed • Class II: Software-hardware interactions • Embedded processors with hardware engines • Large variety in processing speed • Class III: Hardware-hardware interactions • Collections of hardware engines • Hardware engines are capable of significant concurrency compared to processors

Background • Work builds on TMD-MPI[1] • Subset implementation of the MPI standard • Allows hardware engines to be part of the message passing network • Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP • Software libraries for MicroBlaze, PowerPC, Intel X86 [1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.

Class II: Processor-based Optimizations Background Direct Memory Access MPI Hardware Engine Non-Interrupting, Non-Blocking Functions Series of MPI Messages Results and Analysis

Class II: Processor-based OptimizationsBackground • Problem 1 • Standard messageparadigm for HPC systems • Plentiful memory but high message latency • Favours combining data into a few, large messages, which are stored in memory and retrieved as needed • Embedded designs provide different trade-off • Little memory but short message latency • ‘Just-in-time’ paradigm is preferred • Sending just enough data for one unit of computation on demand

Class II: Processor-based OptimizationsBackground • Problem 2 • Homogeneity of HPC systems • Each rank has similar processing capabilities • Heterogeneity of FPGA systems • Hardware engines are tailored for a specific set of functions – extremely fast processing • Embedded processors play vital role of control and memory distribution – little processing

Class II: Processor-based OptimizationsBackground • ‘Just-in-time’ + Heterogeneity = producer-consumer model • Processors produce messages for hardware engines to consume • Generally, the message production rate of the processor is the limiting factor

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • Typical MPI implementations use only software • DMA engine offloads time-consuming, message tasks: memory transfers • Frees processor to continue execution • Can implement burst memory transactions • Time required to prepare a message is independent of message length • Allows messages to be queued

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor

Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • DMA engine is completely transparent to the user • Exact same MPI functions are called • DMA setup is handled by the implementation

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions • Two types of MPI message functions • Blocking functions: returns only when buffer can be safely reused • Non-blocking functions: returns immediately • Request handle is required so the message status can be checked later • Non-blocking functions are used to overlap communication and computation

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions Typical HPC non-blocking use case: MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation();

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions • Class II interactions have a different use case • Hardware engines are responsible for computation • Embedded processors only need to send messages as fast as possible • DMA hardware allow messages to be queued • ‘Fire-and-forget’ message model • Message status is not important • Request handles are serviced by expensive, interrupts

Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions Standard MPI protocol provides a mechanism for ‘fire-and-forget’: MPI_Requestrequest_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy);

Challenges of Using Embedded MPI for Hardware Processing Nodes

Challenges of Using Embedded MPI for Hardware Processing Nodes

Presentation Transcript

Embedded System Hardware

An Open Architecture for an Embedded Signal Processing Subsystem

Embedded operating systems for sensor nodes

Hardware platforms for Embedded computing

Using Analog Devices’ Blackfin for Embedded Processing

Hardware Based Floating Point Processing

An Open Architecture for an Embedded Signal Processing Subsystem

An Entropy-based Learning Hardware Organization Using FPGA

Controlling embedded hardware

Radioactivity measurements using embedded processing

Compiler-Based Code Partitioning for Intelligent Embedded Disk Processing

Embedded System Hardware

Using MPI - the Fundamentals

Embedded Hardware Foundation

Embedded Hardware

Embedded System Hardware

Challenges of Sleeping Nodes

Embedded System Hardware

Embedded Hardware Foundation

Embedded Hardware Design Challenges