Symmetrical Decentralized Processing Architecture Design and Implementation

High Speed Digital Systems Lab Winter 2007/08 Project’s Final Presentation - Part A Project Extent : Two Semesters Symmetrical Decentralized Processing Architecture Design and Implementation Joint with the company: Students: Eran Tuchman Gad Tuchman Instructor: Mr. Evgeny Fiksman

Agenda : Introduction (background, goals, development environment) Project Design Testing Arrangement Resources, Performance and Statistics Part B Timeline/Outcomes Lab Demonstration

Background • Processing power enhancement using multiple CPUs demands load balancing. • Two approaches : symmetrical vs. asymmetrical • Symmetrical systems also having variable processing time.

Project Objectives • Design and implementation of symmetrical • CPUs management architecture. • Efficient use of system resources. • Efficient use of data transfer channels.

General Overview Input : Variable length input vectors packets from the PCI Bus. Output : Variable length output results packets from the processing units. The management system balances the processing load between as many processing units as possible. 32bit/100Mhz 400MB/s 32bit/33Mhz 132MB/s Symmetrical decentralized processing management system Symmetrical processing units PCI BUS

Development Board • Main clock shared between all FPGAs with maximal skew of 50ps.

System Top Hierarchy Diagram

System Top Hierarchy) Zoom In) 32bit/33MHz

Customized NIOS-II • Added custom instruction for outputting state flags : • Each CPU provides two state flags :Cpu_Ready, Data_Ready and polls the following control bits : Remove_Flags, Get_Ready, Start_Working

DPR Memory Mapping Control Bits Output packet offset Statistics offset Input Packet Preprocessed Statistics Output Packet

Packets Structure Control Bits Len ID Control Bits Len ID Input Payload Start Time End Time Output Payload Start Time (Optional Padding) (Optional Padding) SYNC SYNC SYNC SYNC 64 bits 64 bits • Packets structure matches the structure decided by the Host and • Nios development groups for compatibility. • The engine handles timing and ID tagging.

Engine Structure • Handles START, END and ID tags. • Uses the ID as base address for access to the appropriate DPR.

The Main Selector Main Selector Selector Port Selector Port Selector Port Selector Port MainSelector Port • Works with the engine’s clock. • Receives selected CPU ID and needed operation type from • each selection port. • Provides base address and operation type to the engine.

The Selector • Works with the CPUs clock. • Uses 4-Phase protocol for asynchronous coordination • with the Main Selector. • Receives 2*n flags and provides selected CPU ID and • required operation type. • The selection algorithm prevents parasitic preference • of specific CPUs over the others, and grants service within • single selection rotation.

The Selection Algorithm 1 0 1 1 Nios 3 Nios 2 Nios 1 Nios 0 0 0 1 0 Nios-ready flags One-hot fairness register (n-bits) Binary flags Register (n-bits) Subtractor 1 0 0 1 Inverter ("not") 1 0 1 1 0 1 1 0 AND 0 0 1 0 One-hot to binary using only log2(n) "or" gates, with n/2 inputs each. Selection : Nios 1 Nios ID used for select signals

The Preprocessor Preprocessor Preprocessor Port New Value Largest Old Value Comparator Mux • Processes the packets during transfers • for reduction in CPU processing time. • Examples : Finds minimum, maximum • and average values. • Preprocessor results added to the designated DPR at end of transfer. • Sample calculation :

Simulation array • Waves simulation using ModelSim. • FIFOs simulation using our custom modules • supporting text files for in/out packets.

Real-Time testing array Data Display Packets Generator PC Computer Driver Symmetrical decentralized processing management system ProcStar-II Development Board Nios-II systems array

Tested system • Tested with 10 CPUs using fast Quartus synthesis

Our Examination Software • Examination of system response to different processing delays. • Creation of input and output packets with varying random lengths. • Provides ability to edit, transfer and analyze packets.

Time Tags • Measurement of routing system delays since head of input queue • till end of transfer to output queue, without the processing delay. • Delay = End Tag – Start Tag – Processing Delay + Output packet transfer delay. Packet awaits treatment and passes to output queue Packet awaits treatment and passes to the CPU + Packet passes from the computer to head of input queue (INPUT FIFO) Packet passes to the computer from output queue (OUTPUT FIFO) Packet becomes processed t [usec]

System Delays • 2048 Sent packets with random lengths : • Tested system includes Engine with 10 CPUs , 100MHz clock.

Transfer Rates • Tested system returns output packets with altered random length.

Processing Time • Uniform distribution of packets over total of 10 CPUs.

Amount of packets in the system • Tested Processing time is within same order of magnitude • as the packets transfer time.

Resources Usage Comparison • The restricting resource is amount of M4K blocks. • Routing system doesn’t use restricting resource. • Logic Utilization = 1621.8 x (CPUs) + 3421

Part A - Project Achievements • Functional system using PROCStarII-60 board. • Supports Avalon-Bus and required packet formats. • Interfacing FIFOs, Selector, Engine and 10 NIOS Systems. • Our Host software for loading/offloading FIFOs.

Part B - Schedule 12-26/10/2008 - Documenting and writing of project book – Part A. 27/10 – 10/11/2008 - Integrating with the other Host software and C2H hardware. 11/11 – 25/11/2008- Design and implementation of the Preprocessor and MainSelector. 26/11-9/12/2008 - Fitting multiple FPGAs over the 180 board and splitting clocks. 10/12-24/12/2008 - Measuring performance using our software, Documenting and writing of project book – Part B.

Appendix

Startix-II Content

Startix-II Logic Elements

Startix II Memory Blocks • M4K and M-RAM blocks can be halved into two single-port blocks.

DPR Memory blocks • M4K and M-RAM blocks cannot be halved into two DPR blocks.

NIOS-II Types

Trace Delays

Shared Bus Delays MainBus[84:0] trace delays : (1.8v, 10mA) 0.5ns ~ 2.4nswhileIC1/IC4 drives the line. 0.5ns ~ 3.2nswhileIC2/IC3 drives the line. Worst case constrains for internal M-RAM blocks : Worst case constrains for internal M4K blocks :

Shared bus delays (cont.) = Delay from I/O output register to output pad = Delay from input pad to I/O input register = Delay from input I/O datain to output pad = Delay from input pad to I/O dataout to core Delay of input/output pads (EP2S180C3) : (over main bus)

The tri-state bridgeand shared-bus width • Allows to combine peripheral components outside the FPGA to the internal SOPC system using bi-directional bus. • Maximal address bus : • 42 CPUs x 4 FPGAs x 2K DWORDs Each  19 bit Address Bus • Total of 54 lines : • Address[18:0], Data[31:0], Read, Write, OutputEnable

DPR components We will test two types of DPR for Part B of the project : DPR - 1 Avalon-MM Slave Port DPR - 2 Avalon-MM Tristate Slave Port Avalon-MM Slave Port Avalon-MM Slave Port • Part A Project is using standard dual port ram components.

System Tests • Simulation with designated test bench using modelsim. • Real-Time test with designated software we have developed • for the purpose of sending, receiving and analyzing packets.

NIOS-II Processor • We expanded the basic processor with custom instruction • outputting control flags.

SOPC Systems Engine Domain Engine Engine Gate Shared Avalon Bus NIOS Domain Domain Gate Dual Port Ram NIOS System • Creating SOPC surrounding is necessary for interfacing the Avalon Bus.

Test Packets Control Bits Len ID Control Bits Len ID Requested Length Input Payload Start Time End Time Output Payload Start Time CPU Interval (Optional Padding) (Optional Padding) SYNC SYNC SYNC SYNC 64 bits 64 bits • Input and Ouput packets length determined by test software.

Resources usage comparison (cont.) • Part B expectation – routing system wont become a bottleneck.

Semester B Milestones • Construction of the Preprocessor. • Construction of the full Main Selector. • Integration with the other Host software and C2H system. • Timing tests and documentation. • Development with the PROCStar-II 180 board and multiple FPGAs. • (limitation – availability in the lab)

Symmetrical Decentralized Processing Architecture Design and Implementation

Symmetrical Decentralized Processing Architecture Design and Implementation

Presentation Transcript

Design and Implementation

Decentralized Distributed Processing

Architecture, Implementation, and Testing

Design and Implementation of a Consolidated Middlebox Architecture

Symmetrical shapes

Architecture, Design Patterns and Faithful Implementation

UPnP AV Architecture - Generic Interface Design And Java Implementation

FPGA Design, Symmetrical Architecture Approach

Design and Implementation

Design and Implementation*

Symmetrical Decentralized Processing Architecture Design and Implementation

Low Power Architecture and Implementation of Multicore Design

Architecture and Design

Decentralized Planning and programme implementation

Decentralized Planning and programme implementation

Symmetrical triangle Symmetrical triangle

Architecture Implementation

Design and Implementation

Design and Implementation

implementation and architecture support

Design and Implementation*