Symmetrical Decentralized Processing Architecture Design

High Speed Digital Systems Lab Winter 2007/08 Symmetrical Decentralized Processing Architecture Design and Implementation Final Presentation Project Extent : Two Semesters Joint with the company: Students: Eran Tuchman Gad Tuchman Instructor: Mr. Evgeny Fiksman

Agenda : Introduction (background, goals, development environment) Project Design Testing Arrangement Resources, Performance and Statistics Lab Demonstration 2

Background • Processing power enhancement using multiple CPUs • demands load balancing. • Two approaches : symmetrical vs. asymmetrical • Symmetrical systems also having variable processing time. 3

Project Objectives • Design and implementation of symmetrical • CPUs management architecture. • Efficient use of system resources. • Efficient use of data transfer channels. 4

General Overview Input : Variable length input vectors packets from the PCI Bus. Output : Variable length output results packets from the processing units. The management system balances the processing load between as many processing units as possible. PCI BUS 32bit/65Mhz 260MB/s 32bit/33Mhz 132MB/s Symmetrical decentralized processing management system Symmetrical processing units 32bit/146Mhz 5

Development Board • Main clock shared between all FPGAs with maximal skew of 50ps. 6

System Top Hierarchy Diagram 7

Customized NIOS-II • Added custom instruction for outputting state flags : • Each CPU provides two state flags :Cpu_Ready, Data_Ready and polls the following control bits : Remove_Flags, Get_Ready, Start_Working 8

DPR Memory Mapping Control Bits Output packet offset Input Packet Output Packet 9

Packets Structure Control Bits Len ID Control Bits Len ID Input Payload Start Time End Time Output Payload Start Time (Optional Padding) (Optional Padding) SYNC SYNC SYNC SYNC 64 bits 64 bits • Packets structure matches the structure decided by the Host and • Nios development groups for compatibility. • The engine handles timing and NIOSID tagging. 10

Engine Structure • Handles START, END and NIOS ID tags. • Uses the ID as base address for access to the appropriate DPR. 11

The Main Selector • Works with the engine’s clock. • Receives locally selected CPU IDs and needed operation types • from each selection port. • Provides base address and operation type to the engine. 12

The Selector • Works with the CPUs clock. • Uses 4-Phase protocol for asynchronous coordination • with the Main Selector. • Receives 2*n flags and provides selected CPU ID and • required operation type. • The selection algorithm prevents parasitic preference • of specific CPUs over the others, and grants service within • single selection rotation. 13

Simulation array • Waves simulation using ModelSim. • FIFOs simulation using our custom modules • supporting text files for in/out packets. 14

Real-Time testing array Data Display Packets Generator PC Computer Driver Symmetrical decentralized processing management system ProcStar-II Development Board Nios-II systems array 15

Tested system • Tested with 40 CPUs using fast Quartus synthesis 16

Our Examination Software • Examination of system response to different processing delays. • Creation of input and output packets with varying random lengths. 17 • Provides ability to edit, transfer and analyze packets.

Time Tags • Measurement of routing system delays since head of input queue • till end of transfer to output queue, without the processing delay. • Delay = End Tag – Start Tag – Processing Delay + Output packet transfer delay. Packet awaits treatment and passes to output queue Packet awaits treatment and passes to the CPU + Packet passes from the computer to head of input queue (INPUT FIFO) Packet passes to the computer from output queue (OUTPUT FIFO) Packet becomes processed t [usec] 18

System Transfer Rates • Tested system returns output packets with altered random length. 19

System’s response under load • Uniform distribution of packets over total of 40 CPUs. • Tested using proprietary NIOS software. 20

Deinterleaver Processing Rates • Third-party software implementation performance compared to PC : • Proof of concept had been achieved. • Tested using third-party deinterleaver NIOS software. • 40 CPUs at 146MHz compared against Dou Core 2 at 2400MHz. 21

Deinterleaver Processing Time The architecture is optimal for the specified algorithm • Tested using third-party deinterleaver NIOS software. • 40 CPUs at 146MHz compared against Dou Core 2 at 2400MHz. 22

Assumptions • Utilizing FPGA’s logic resources using C2H • may increase system’s performance by 3 times. • Optimizing third-party’s implementation memory usage • may increase amount of processing units by 20 percent. • New FPGA families have more resources and may run at • faster clock. • using the two cores of the E6600 may increase • PC performance by 1.8 times. 23

Summary • Functional system using PROCStarII-180 board. • Running 40 SOPC systems simultaneously. • Integration with third party Host and NIOS software. • Our own host software for loading/offloading FIFOs. 24

Conclusions • System’s bottleneck is the third-party algorithm • implementation. • Moderate use of memory resources may increase • amount of processing units. • Utilizing FPGA’s logic resources using C2H • should improve third-party’s software algorithm • implementation. 25

Appendix

Startix-II Content

Startix-II Logic Elements

Startix II Memory Blocks • M4K and M-RAM blocks can be halved into two single-port blocks.

DPR Memory blocks • M4K and M-RAM blocks cannot be halved into two DPR blocks.

NIOS-II Types

Trace Delays

Shared Bus Delays MainBus[84:0] trace delays : (1.8v, 10mA) 0.5ns ~ 2.4nswhileIC1/IC4 drives the line. 0.5ns ~ 3.2nswhileIC2/IC3 drives the line. Worst case constrains for internal M-RAM blocks : Worst case constrains for internal M4K blocks :

Shared bus delays (cont.) = Delay from I/O output register to output pad = Delay from input pad to I/O input register = Delay from input I/O datain to output pad = Delay from input pad to I/O dataout to core Delay of input/output pads (EP2S180C3) : (over main bus)

The tri-state bridgeand shared-bus width • Allows to combine peripheral components outside the FPGA to the internal SOPC system using bi-directional bus. • Maximal address bus : • 64 CPUs x 4 FPGAs x 2K DWORDs Each  19 bit Address Bus • Total of 54 lines : • Address[18:0], Data[31:0], Read, Write, OutputEnable

System Tests • Simulation with designated test bench using modelsim. • Real-Time test with designated software we have developed • for the purpose of sending, receiving and analyzing packets.

NIOS-II Processor • We expanded the basic processor with custom instruction • outputting control flags.

SOPC Systems Engine Domain Engine Engine Gate Shared Avalon Bus NIOS Domain Domain Gate Dual Port Ram NIOS System • Creating SOPC surrounding is necessary for interfacing the Avalon Bus.

Proprietary Test Packets Control Bits Len ID Control Bits Len ID Requested Length Input Payload Start Time End Time Output Payload Start Time CPU Interval (Optional Padding) (Optional Padding) SYNC SYNC SYNC SYNC 64 bits 64 bits • Input and Ouput packets length determined by test software.

The Selection Algorithm 1 0 1 1 Nios 3 Nios 2 Nios 1 Nios 0 0 0 1 0 Nios-ready flags One-hot fairness register (n-bits) Binary flags Register (n-bits) Subtractor 1 0 0 1 Inverter ("not") 1 0 1 1 0 1 1 0 AND 0 0 1 0 One-hot to binary using only log2(n) "or" gates, with n/2 inputs each. Selection : Nios 1 Nios ID used for select signals

System Delays • 2048 Sent packets with random lengths (32bit DWORDs) : • Tested system includes Engine and 40 CPUs • running at 65MHz/146.25MHz .

System Top Hierarchy) Zoom In) 32bit/33MHz

Project Achievements • Functional system using PROCStarII-180 board. • Running 40 SOPC systems simultaneously. • Integration with third party Host and NIOS software. • Our own host software for loading/offloading FIFOs.

Symmetrical Decentralized Processing Architecture Design

Symmetrical Decentralized Processing Architecture Design

Presentation Transcript

Design and Implementation

Decentralized Distributed Processing

Architecture, Implementation, and Testing

Design and Implementation of a Consolidated Middlebox Architecture

Symmetrical shapes

Architecture, Design Patterns and Faithful Implementation

UPnP AV Architecture - Generic Interface Design And Java Implementation

FPGA Design, Symmetrical Architecture Approach

Design and Implementation

Design and Implementation*

Symmetrical Decentralized Processing Architecture Design and Implementation

Low Power Architecture and Implementation of Multicore Design

Architecture and Design

Decentralized Planning and programme implementation

Decentralized Planning and programme implementation

Symmetrical triangle Symmetrical triangle

Architecture Implementation

Design and Implementation

Design and Implementation

implementation and architecture support

Design and Implementation*