210 likes | 369 Vues
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab. Winter 2009. Implementing a NoMC on the Gidel platform mid-semester presentation. Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch. Table of Contents.
E N D
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Winter 2009 Implementing a NoMC on the Gidel platformmid-semester presentation Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch
Project goals • Implementing a parallel processing system which contains several NoCs, each chip containing several sub-networks of processors. • Converting existing router to support Altera platform. • Expanding the router to enable communications between similar sub-networks. • Implementing a processor network which supports communication with the PC enabling: • Use of PC’s CPU as part of the processing network. • Simple I/O between PC and the rest of the processing network.
Top-level design elements • Network topology –basic network architecture: • Intra-chip • Inter-chip • Chip-PC • CPU architecture – various models of Nios II: • Processor performance • Data/Code cache • Internal multiplier/divider • CPU/router interface – • Synchronous transfer (interrupt + CPU controls transfer of data) • Asynchronous transfer (DMA read/write) • Routing algorithms - • Routing • Broadcast
Fabric topology • Alternatives: SPIN CLICHE Torus • Considerations: • Router to processor ratio • Number of ports necessary (area, simplicity, inter-router scheduling) • Scalability • Congestion • Latency
Our topology: • Fabric has cliché structure: easily scalable, needs max. 5 ports • Each FR is connected to a cluster of CPUs • Cluster uses local router • Higher CPU to router ratio (R~P in cliché, R~P/2 in our topology) • Increased latency is masked by CPU to router clock speed ratio • Increased congestion With #CPUs we plan to use:
Top-level structure of the expanded network • Each white square represents a single FPGA on the Gidel board. • FPGA-FPGA, FPGA-PC routes go via designated routers (GW). • The GWs design/protocols are the same as the internal routers.
Fabric topology • Each fabric router is connected to a cluster of processors using a local router. • GWs are used to connect to other chips and to PC. • Primary/Secondary ICGW determined by #chips to left/right • (interconnects /GWs will be explained when we deal with routing algorithm).
Structure of a single sub-network (from previous project) Single fabric element for our project
CPU architecture • CPU architecture – various models of Nios II: • Processor speed • Data/Code cache • Internal multiplier/divider • Pros and cons: • Performance • Area • I/O, CPU contention (more in the next slide). 10 10
CPU/router I/F Avalon bus Synchronous interface: C/I CPU PIO • Using custom instructions: • Connect interrupt using PIO. • Connect FIFO directly to CPU to avoid avalon-bus access cycles. Router-FIFO Avalon bus Asynchronous interface: CPU Memory DMA FIFO I/F • Using DMA: • Connect interrupt using PIO. • Connect FIFO directly to avalon using router-fifo interface. • Same interface also makes interrupt to start DMA read. Router-FIFO
CPU/router I/F Synchronous/asynchronous transfer : • Problem: synchronous transfer requires CPU to direct I/O, disabling the CPUs ability to perform calculations during I/O. • Solution: introduce asynchronous transfer using DMA, data is copied directly from input FIFO to NIOS II memory. • Requirements: CPU needs a data cache to prevent congestion on Avalon bus (CPU memory reads and DMA-FIFO memory read/writes. • Pros and cons: Larger CPU (area), simplicity of implementation. Avalon bus CPU Memory DMA FIFO I/F Router-FIFO
Summary (comparing number of total number of CPUs ) • Estimation of router/processor area according to existing router/NIOS and total LEs available on Stratix II FPGA. • Slow CPU enables almost double number of CPUs but performance is less than half. • We chose fast w/DMA using our topology. Relative area of router/proc.
Software design Software layers Add async. functions • Application Layer: MPI functions interface • Network Layer: hardware independent implementation of these functions • Data layer: relies on command bit fields • Physical layer: designed for FSL bus Adjust to conform with alterai/f
Software design • Asynchronous transfer requires additional MPI functions: • MPI_isend – non blocking send • MPI_irecv – non blocking receive • MPI_test – test whether data has arrived in receive buffer • MPI_wait – blocking wait for data to arrive
Routing algorithms • Routing categories: • Static/dynamic • Source/hop by hop • Centralized/distributed • Splitting/non splitting • Intra-chip algorithm to be used • Static, centralized, hop by hop, non-splitting • Static routing tables to be loaded into tables on router • Run algorithm on node map of each chip, find shortest path, each router holds table with next hop for each other node. (#nodes ~ 60) 16
Routing algorithms • Inter-chip routing • Adjacent FPGAs connect using “neighbour bus” • FPGAs 1,4 connect using slower “main bus” • Distribute routes evenly (each arc a-d carries exactly 3 “traffic units, e.g. arc (a) is used to carry data from 2,3 to 1 and from 1 to 2,3. • Result: Assuming evenly distributed communication between processing units, each FPGA uses interconnect on one side twice as much as the interconnect on the other side, hence the “primary” and “secondary” inter connect gateways. 1 2 3 4 d d a b c 17
Routing algorithms • Primary/secondary GW intra-chip implifications: • Because of above assumption, more internal fabric routers will be connected to one gateway than to the other. • Assume that I/O to the PC isn’t dominant enough to justify connection of all fabric routers to PCGW. • Connect S-ICGW and PCGW so as to minimize internal congestion in fabric. 18
Broadcast algorithms • Build static broadcast tree • Run algorithm to make spanning tree on network nodes • Each node stores collection of ports (=arcs) which are part of the tree – read-only tables stored in HW • Upon initiating broadcast message: send to all BC ports • Upon receiving broadcast message on BC port – send to all other BC ports 19
Questions Questions