1 / 23

NePSim: A Network Processor Simulator with Power Evaluation Framework IEEE Micro, Sept

9/13/2012. Laxmi N. Bhuyan University of California, Riverside. 2. NP Architecture Design - Research Goals. NPs provide performance and programmability for packet processing without O/S overheadFuture design of NPs (with hardware accelerators) should be based on accurate estimation of e

larya
Télécharger la présentation

NePSim: A Network Processor Simulator with Power Evaluation Framework IEEE Micro, Sept

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. NePSim: A Network Processor Simulator with Power Evaluation Framework IEEE Micro, Sept/Oct 2004 Source code at http://www.cs.ucr.edu/~yluo/nepsim Yan Luo, Jun Yang, Laxmi N. Bhuyan, Li Zhao Computer Science & Engineering University of California at Riverside

    2. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 2 NP Architecture Design - Research Goals NPs provide performance and programmability for packet processing without O/S overhead Future design of NPs (with hardware accelerators) should be based on accurate estimation of execution performance gain instead of intuition Power consumption of NPs is becoming a big concern. Techniques are needed to save power when traffic is low. Need for an open-source Execution-driven simulator that can be used to explore architectural modifications for improved performance and power consumption A ST200 edge router can support up to 8 NP boards each of which consumes 95~150W, The total power of such a 88.4cm x 44cm x 58cm router can reach 2700W when two chasis are supported in a single rack! Laurel Networks ST series router data sheet A ST200 edge router can support up to 8 NP boards each of which consumes 95~150W, The total power of such a 88.4cm x 44cm x 58cm router can reach 2700W when two chasis are supported in a single rack! Laurel Networks ST series router data sheet

    3. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 3 NP Simulation Tools Intel IXA SDK + accuracy, visualization - closed-source, low speed, inflexibility, no power model Cant incorporate new hardware designs SimpleScalar + open-source, popular, power model (wattch) - Uniprocessor architecture - disparity with real NP NePSim + open-source, real NP, power model, accuracy - currently target IXP1200 2400 under development About 60 software downloads and 300 hits at http://www.cs.ucr.edu/~yluo/nepsim/

    4. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 4 Objectives of NePSim An open-source simulator for a real NP (Intel IXP1200, later IXP2400/2800) Cycle-level accuracy of performance simulation Flexibility for users to add new instructions and functional units Integrated power model to enable power dissipation simulation and optimization Extensibility for future NP architectures Faster simulation compared to SDK

    5. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 5 NePSim Software Architecture ME core (ME ISA, 5-stage pipeline, GPR, xfer registers) SRAM and SDRAM (data memory and controller with command queues) FBI unit (ixbus and CSRs) Device (network interface with in/out buffer, packet streams) Dlite(a light-weighted debugger) Stats (collection of statistics data) Traffic generator, program parser and loader Each block is a software module in NePSim.Each block is a software module in NePSim.

    6. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 6 NePSim Overview SDK compiler compiles microengine C program. NePSim includes a parser which converts compiler generated code into an internal format. NePSim takes programs in the internal format as input. NePSim currently cannot take assembler generated code or microcode. Host C compiler is used for build NePSim executable from NePSim source code.SDK compiler compiles microengine C program. NePSim includes a parser which converts compiler generated code into an internal format. NePSim takes programs in the internal format as input. NePSim currently cannot take assembler generated code or microcode. Host C compiler is used for build NePSim executable from NePSim source code.

    7. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 7 NePSim Internals (I) NePSim internal instruction, command, and event data structure. Number of cycles taken by each instruction depends on the stages of execution in the pipeline. Details were derived from SDK Hardware manual.NePSim internal instruction, command, and event data structure. Number of cycles taken by each instruction depends on the stages of execution in the pipeline. Details were derived from SDK Hardware manual.

    8. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 8 NePSim Internals (II) Yellow box is the ME. It has a 5-stage pipeline. Instructions (I) go through this pipeline. P4 may generate memory reference commands (C) to sram/sdram/fbi controllers. Each controller has queues with different priority levels. SRAM 3 queues and SDRAM 4. Arbiters service commands according to its arbitration policy. Memory commands are served and a corresponding data ready event are put into the event queue. The head of event queue generates a wake-up signal to sleeping threads.Yellow box is the ME. It has a 5-stage pipeline. Instructions (I) go through this pipeline. P4 may generate memory reference commands (C) to sram/sdram/fbi controllers. Each controller has queues with different priority levels. SRAM 3 queues and SDRAM 4. Arbiters service commands according to its arbitration policy. Memory commands are served and a corresponding data ready event are put into the event queue. The head of event queue generates a wake-up signal to sleeping threads.

    9. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 9 NePSim parameters -d: enable debug message -I: start simulation in Dlite debugger mode -proc:speed processor speed in MHz -vdd default chip power supply -me0 program for microengine 0 -script script file used for initialization -strmconf stream config file of packet traces to network devices -max:cycle maximum number of cycles to execute -indstrm mark packet stream file as indefintely repeated -power flag to enable power calculations Some example parameters to run NePSim simulationsSome example parameters to run NePSim simulations

    10. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 10 Dlite Debugger Similar with Dlite in SimpleScalar Run simulation in debug mode Set/delete breakpoints Step into pipeline execution, check ALU condition code Examine threads PC, status etc. Examine register contents Examine memory (sram, sdram) contents We provide a integrated debugger. Users can run simulation in debug mode. This is very helpful for debugging both benchmark applications and NePSim itself.We provide a integrated debugger. Users can run simulation in debug mode. This is very helpful for debugging both benchmark applications and NePSim itself.

    11. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 11 IVERI a verification tool To verify NePSim against IXP1200 is not a trivial task Multiple MEs, threads, memory units Huge amount of events (of pipeline and memory) generated in simulation process Have to pin-point error by scanning huge log traces IVERI tool Assertion checking based on Linear Temporal Logic (LTL) and Logic of Constraint (LOC) Log architectural events in both NePSim and IXP1200(with SDK) <cycle, PC, alu_out, address, event_type> Event_type can be pipeline, sram_enq, sdram_deq etc. Use LOC Assertion to specify performance requirement E.g. the execution time of an instruction in pipeline of NePSim is no more than D cycles away from the execution time of the same instruction in IXP1200 PC(pipline[I])== PC(pipeline_IXP(I)) ^ |cycle(pipeline(I)) - cycle(pipeline_IXP(I))| <= D IVERI generate verification code based on assertions Verification code scans log trace, reports number and location of constraint violations.

    12. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 12 Performance Validation of NePSim Throughput We We

    13. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 13 Power Model GPR General Purpose Registers, XFER Transfer registers for memory referenceGPR General Purpose Registers, XFER Transfer registers for memory reference

    14. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 14 Benchmarks Ipfwdr IPv4 forwarding(header validation, trie-based lookup) Medium SRAM access url Examining payload for URL pattern, used in content-aware routing Heavy SDRAM access Nat Network address translation medium SRAM access Md4 Message digest (compute a 128-bit message signature) Heavy computation and SDRAM access

    15. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 15 Performance implications More MEs do not necessarily bring performance gain More ME cause more memory contention ME idle time is abundant (up to 42%) Faster ME core results in more ME idle time with the same memory Non-optimal rcv/xmit configuration for NAT (transmitting ME is a bottleneck) Speed ratio is the ratio between processor speed and memory speed. Memory speed is fixed at 116 MHz, so the bars represent processor speeds of 232 and 464 MHz. All Figs for 4 receiving and 2 transmitting MEs. Speed ratio is the ratio between processor speed and memory speed. Memory speed is fixed at 116 MHz, so the bars represent processor speeds of 232 and 464 MHz. All Figs for 4 receiving and 2 transmitting MEs.

    16. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 16 Where does the Power Go? Power dissipation by rcv and xmit MEs is similar across benchmarks Transmitting MEs consume ~5% more than receiving ALU consumes significant power ~45% (wattch model) Control store uses ~28% (accessed almost every cycle) GPRs burn ~13% , shifter ~7%, static ~7% The ALU power is high. It might be because we derive it from Wattch model. Wattchs ALU is more complex than ME. And Wattch derives the number from a old reference. Figures show power consumed by different MEs. The top blue part is the power consumed by SRAM, SDRAM controllers, command bus, etc. The IX bus is not modelled. The ALU power is high. It might be because we derive it from Wattch model. Wattchs ALU is more complex than ME. And Wattch derives the number from a old reference. Figures show power consumed by different MEs. The top blue part is the power consumed by SRAM, SDRAM controllers, command bus, etc. The IX bus is not modelled.

    17. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 17 Power efficiency observations Power consumption increases faster than performance More MEs/threads bring more idle time due to memory contention Motivation for DVSMotivation for DVS

    18. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 18 Dynamic Voltage Scaling in NPs During the ME idle time, all threads are in ``wait'' state and the pipeline has no activity. Applying DVS while MEs are not very active can reduce the total power consumption substantially. DVS control scheme Observes the ME idle time (%) periodically. When idle > threshold, scale down the voltage and frequency (VF in short) by one step unless the minimum allowable VF is hit. Idle < threshold, scale up the VF by one step unless they are at maximum allowable values.

    19. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 19 DVS Considerations transition step meaning whether or not to use continuous or discrete changes in VF Use discrete VF steps transition status indicating if we allow the ME to continue working during VF regulation Pause ME while regulating VF transition time between two different VF states 10 us, [Burd][Shang][Sidiropoulos] transition logic complexity, i.e. the overhead of the control circuit that monitors and determines a transition

    20. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 20 DVS Power-performance Initial VF=1.3V, 600MHz DVS period: every 15K, 20K or 30K cycles make a DVS decision to reduce or increase FV. Up to 17% power savings with less than 6% performance loss On average 8% power saving with <1% performance degradation Threshold = ?Threshold = ?

    21. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 21 Ongoing and future work Extend NePSim to IXP2400/2800 Dynamically shutdown/activate MEs Dynamically allocate task on MEs Model SRAM and SDRAM module power Integrate StrongARM/Xscale simulation Dynamically allocate task means a ME can be configured or instructed to run rcv/xmit/processing task based on current system status. For example, if transmitting is bottleneck now, a receiving ME can be turned into a transmitting ME at run-time.Dynamically allocate task means a ME can be configured or instructed to run rcv/xmit/processing task based on current system status. For example, if transmitting is bottleneck now, a receiving ME can be turned into a transmitting ME at run-time.

    22. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 22 NP-Based Projects at UCR Intel IXA Program: NP Architecture Lab Architecture research - CS 162 Assignments Based on IXP 2400 NSF: Design and Analysis of a Web Switch (Layer 5/7 switch) Using NP TCP Splicing Load Balancing etc. Intel and UC Micro: Architectures to Accelerate Data Center Servers TCP and SSL Offload to Dedicated Servers and NPs, XML Servers, etc. Los Alamos National Lab: Intelligent NP-Based NIC Design for Clusters O/S Bypass Protocols, User Level Communication, etc.

    23. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 23 References [Burd] T. Burd and R. Brodersen, Design issues for dynamic voltage scaling,International Symposium on Low Power Electronics and Design, pp. 9--14, 2000. [Shang] L. Shang, L.-S. Peh, and N. K. Jha, Dynamic voltage scaling with links for power optimization of interconnection networks,The 9th International Symposium on High-Performance Computer Architecture,pp. 91--102, 2003. [Sidiropoulos] S. Sidiropoulos, D. Liu, J. Kim, G. Wei, and M. Horwitz, Adaptive bandwidth DLLs and PLLs using regulated supply CMOS buffers,IEEE Symposium on VLSI Circuits, pp. 124--127, 2000.

More Related