Integrated Management of Power Aware Computing & Communication Technologies

Integrated Management of Power Aware Computing & Communication Technologies Review Meeting Nader Bagherzadeh, Pai H. Chou, Fadi Kurdahi, UC Irvine Jean-Luc Gaudiot, USC,Nazeeh Aranki, Benny Toomarian, JPL DARPA Contract F33615-00-1-1719 June 13, 2001 JPL -- Pasadena, CA

Agenda • Administrative • Review of milestones, schedule • Technical presentation • Progress • Applications (UAV/DAATR, Rover, Deep Impact, distributed sensors) • Scheduling (system-level pipelining) • Advanced microarchitecture power modeling (SMT) • Architecture (mode selection with overhead) • Integration (Copper, JPL, COTS data sheet) • Lessons learned • Challenges, issues • Next accomplishments • Questions & action items review.

behavioral system model high-level components composition operators parameterizable components system architecture busses, protocols Quad Chart Behavior Innovations high-level simulation • Component-based power-aware design • Exploit off-the-shelf components & protocols • Best price/performance, reliable, cheap to replace • CAD tool for global power policy optimization • Optimal partitioning, scheduling, configuration • Manage entire system, including mechanical & thermal • Power-aware reconfigurable architectures • Reusable platform for many missions • Bus segmentation, voltage / frequency scaling functional partitioning & scheduling Architecture mapping system integration& synthesis static configuration dynamic powermanagement Year 1 Year 2 Impact Kickoff 2Q 02 2Q 00 2Q 01 • Static & hybrid optimizations • partitioning / allocation • scheduling • bus segmentation • voltage scaling • COTS component library • FireWire and I2C bus models • Static composition authoring • Architecture definition • High-level simulation • Benchmark Identification • Dynamic optimizations • task migration • processor shutdown • bus segmentation • frequency scaling • Parameterizable components library • Generalized bus models • Dynamic reconfiguration authoring • Architecture reconfiguration • Low-level simulation • System benchmarking • Enhanced mission success • More task for the same power • Dramatic reduction in mission completion time • Cost saving over a variety of missions • Reusable platform & design techniques • Fast turnaround time by configuration, not redesign • Confidence in complex design points • Provably correct functional/power constraints • Retargetable optimization to eliminate overdesign • Power protocol for massive scale

Program Overview • Power-aware system-level design • Amdahl's law applies to power as well as performance • Enhance mission success (time, task) • Rapid customization for different missions • Design tool • Exploration & evaluation • Optimization& specialization • Technique integration • System architecture • Statically configurable • Dynamically adaptive • Use COTS parts & protocols

Personnel & teaming plans • UC Irvine - Design tools • Nader Bagherzadeh - PI • Pai Chou - Co-PI • Fadi Kurdahi • Jinfeng Liu • Dexin Li • Duan Tran • USC - Component power optimization • Jean-Luc Gaudiot - faculty participant • Seong-Won Lee - student • JPL - Applications & benchmarking • Nazeeh Aranki • Nikzad “Benny” Toomarian - students

Milestones & Schedule • Static & hybrid optimizations • partitioning / allocation • scheduling • bus segmentation • voltage scaling • COTS component library • FireWire and I2C bus models • Static composition authoring • Architecture definition • High-level simulation • Benchmark Identification • Dynamic optimizations • task migration • processor shutdown • bus segmentation • frequency scaling • Parameterizable components library • Generalized bus models • Dynamic reconfiguration authoring • Architecture reconfiguration • Low-level simulation • System benchmarking

Review of Progress • May'00 Kickoff meeting (Scottsdale, AZ) • Sept'00 Review meeting (UCI) • Scheduling formulation, UI mockup, System level configuration • Examples: Pathfinder & X-2000 (manual solution) • Nov'00 PI meeting (Annapolis, MD) • Tools: scheduler + UI v.1 (Java) • Examples: Pathfinder & X-2000 (automated) • Apr'01 PI meeting (San Diego, CA) • Tools: scheduler + UI v.2 - v.3 (Jython) • Examples: Pathfinder & initial UAV (Pipelined) • June'01 Review meeting we are here!

New for this Review (June '01) • Tools • Scheduler + UI v.4 (pipelined, buffer matching) • Mode selector v.1 (mode change overhead, constraint based) • SMT model • Examples: • Pathfinder, µAMPS sensors (mode selection) • UAV, Wavelet (dataflow) (pipelined, detailed estimate) • Deep Impact (command driven) (planning) • Integration • Input from Copper: timing/power estimation (PowerPC simulation model) • Output to Copper: power profile + budget (Copper Compiler) • Within IMPACCT: initial Scheduler + Mode Selector integration

Overview of Design Flow • Input • Tasks, constraints, component library • Estimation (measurement or simulation via COPPER) • Refinement Loop • Scheduling (pipeline/transform…) • Mode Selection (either before or after scheduling) • System level simulation (planned integration) • Output: to COPPER • Interchange Format: • Power Profile, Schedule, Selected modes • Code Generation • Microarchitecture Simulation

Design Flow task allocation, component selection task model, timing /power constraints scheduler high-level simulator IMPACCT component library mode model mode selector power + timing estimation power profile, C program powersimulator Compiler low-level simulator COPPER executable

Power Aware Scheduling • Execution model • Multiple processors, multiple power consumers • Multiple domains: digital, thermal, mechanical • Constraint driven • Min / Max power • Min / Max timing constraints • Handles problems in different domains • Time Driven • System level pipelining -- in time and in space • Parallelism extraction • Experimental results • Coarse to fine grained parallelism tradeoffs

Prototype of GUI scheduling tool • Power-aware Gantt chart • Time view • Timing of all tasks on parallel resources • Power consumption of each task • Power view • System-level power profile • Min/max power constraint, energy cost • Interactive scheduling • Automated schedulers – timing, power, loop • Manual intervention – drag & drop • Demo available

Power-Aware Scheduling • New constraint-based application model [paper at Codes'01] • Min/Max Timing constraints • Precedence, subsumes dataflow, general timing, shared resource • Dependency across iteration boundaries – loop pipelining • Execution delay of tasks – enables frequency/voltage scaling • Power constraints • Max power – total power budget • Min power – controls power jitter or force utilization of free source • System-level, multi-scenario scheduling [paper at DAC'01] • 25% Faster while saving 31% energy cost • Exploits "free" power (solar, nuclear min-output) • System-level loop pipelining [working papers] • Borrow time and power across iteration boundaries • Aggressive design space exploration by new constraint classification • Achieves 49% speedup and 24% energy reduction

Scheduling case study:Mars Pathfinder • System specification • 6 wheel motors • 4 steering motors • System health check • Hazard detection • Power supply • Battery (non-rechargeable) • Solar panel • Power consumption • Digital • Computation, imaging, communication, control • Mechanical • Driving, steering • Thermal • Motors must be heated in low-temperature environment

Scheduling case study:Mars Pathfinder • Input • Time-constrained tasks • Min/Max Power constraints • Rationale: control jitter, ensure utilization of free power • Core algorithm • Static analysis of slack properties • Solves time constraints by branch&bound • Solves power constraints by local movements within slacks • Target architecture • X-2000 like configurable space platform • Symmetric multiprocessors, multi-domain power consumers, solar/batt • Results • Ability to track power availability • Finishes task faster while incurring less energy cost

More aggressive scheduling:System-level pipelining • Borrow tasks across iterations • Alleviates "hot spots" by spreading to another iteration • Smooth out utilization by borrowing across iterations • Core techniques • Formulation: separate pseudo dependency from true dependency • Static analysis and task transformation • Augmented scheduler for new dependency • Results -- on Mars Pathfinder example • Additional energy savings with speedup • Smoother power profile

Scheduling case study:UAV DAATR • Example of a very different nature! • Algorithm, rather than "system" example • Target architecture • C code -- unspecified; assume sequential execution, no parallelism • MatLab -- unmapped • Algorithm • Sequential, given in MatLab or C • Potential parallelism in space, not in time • Constraints & dependencies • Dataflow: partial ordering • Timing: latency; no pairwise Min/Max timing • Power: budget for different resolutions

Scheduling case study:UAV example (cont'd) • Challenge: Parallelism Extraction • Essential to enable scheduling • Difficult to automate; need manual code rewrite • Different pipeline stages must be relatively similar in length • Rewritten code • Inserted checkpoints for power estimation • Error prone buffer mapping between iterations • Found a dozen bugs in benchmark C code • Missing Summation in standard deviation calculation • Frame buffer off by one line • Dangling pointers not exposed until pipelined

ComputeDistance ComputeDistance ComputeDistance ComputeDistance ATR application: what we are given 1 Frame Bugs Target Detection 3 filters m Detections FFT FFT FFT FFT FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT

Bug report • Misread input data file • OK, no effect to the algorithm • Miscalculate mean, std for image • OK, these values not used (currently) • Wrong filter data for SUN/PowerPC • OK for us, since we operate on different platforms • Bad for SUN/PowerPC users, wrong results • Misplaced FFT module • The algorithm is wrong • However, these problems are not captured in the output image files

FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT ComputeDistance ComputeDistance ComputeDistance ComputeDistance What it should look like 1 Frame Target Detection m Detections 3 filters k distances

FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT ComputeDistance ComputeDistance ComputeDistance ComputeDistance What it really should look like 1 Frame Target Detection m Detections 3 filters k distances

Problems • Limited parallelism • Serial data flow with tight dependency • Parallelism available (diff. detections, filters, etc) but limited • Limited ability to extract parallelism • Limited by serial execution model (C implementation) • No available parallel platforms • Limited scalability • Cannot guarantee response time for big images (N2 complexity) • Cannot apply optimization for small images (each block is too small) • Limited system-level knowledge • High-level knowledge lost in a particular implementation

Single DFG (vertical flow) Cluster by N DFGs(horizontal duplication) Input:N simultaneous frames Target Detection Target Detection N Frames(N target detection) Partitioning (horizontal cuts) m Detections m Detections FFT FFT M Targets(M FFTs) FFT FFT 3 filters 3 filters Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT M Targets(3M IFFTs) k distances k distances ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance K Distances(2K IFFTs) Our vision: 2-dimensional partitioning Output: target detection w/ distance for N simultaneous frames

System-level blocks Input:N simultaneous frames N Frames(N target detection) Target Detection M Targets(M FFTs) FFT M Targets(3M IFFTs) Filter/IFFT K Distances(2K IFFTs) Compute Distance Output: target detection w/ distance for N simultaneous frames

Target Detection Target Detection Target Detection Target Detection FFT FFT FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Compute Distance Compute Distance Compute Distance Compute Distance Our vision

Group 0 Group 1 Group 0 Group 5 Group 2 Group 2 Group 3 Group 1 Group 0 Group 0 Group 1 Group 2 Group 4 Group 1 Group 3 Group 3 Group 0 Group 2 Group 4 Group 1 System-level pipelining Input:N simultaneous frames Target Detection FFT Filter/IFFT Compute Distance Output: target detection w/ distance for N simultaneous frames

What does it buy us? • Parallelism • All modules run in PARALLEL • Each module processes N (M, K) INDEPENDENT instances, that could all be processed in parallel • NO DATA DEPENDENCY between modules • Throughput • Throughput multiplied by processing units • Process N frames at a reduced response time • Better utilization of resources

What does it buy us? (cont'd) • Flexibility • Insert / remove modules at any time • Adjust N, (M or K) at any time • Make each module parallel / serial at any time • More knobs to tune: parallelism / response time / throughput / power • Driven by run-time constraints • Scalability • Reduced response time on big images (small N and/or deeper pipe) • Better utilization/throughput on small images • More compiler support • Simple control / data flow: each module is just a simple loop, which is essentially parallel • Need an automatic partitioning tool to take horizontal cuts

What does it buy us: how power-aware is it? • Subsystems shut-down • Turn on / off any time based on power budget • Split / merge (migrate) modules on demand • Power-aware scheduling • Each task can be scheduled at any time during one pipe stage, since they are totally independent • More scheduling opportunity with an entire system • Dynamic voltage/frequency scaling • The amount of computation N, (M or K) is known ahead of time • Scaling factor = C / N (very simple!) • Less variance of code behavior => strong guarantee to meet deadline, more accurate power estimates • Run-time code versioning • Select right code based on N, (M or K)

Experimental implementation:pipelining transformation • Goal • To make everything completely independent • Methodology • Dataflow graph extraction (vertical) • Initial partitioning (currently manual with some aids from COPPER) • Horizontal clustering • Horizontal cut (final partitioning) • Techniques • Buffer assignment: each module gets its own buffer • Buffer renaming: read/write on different buffer • Circular buffer: each module gets a window of fixed buffer size • Our approach: the combination

c c c c c c b b b b b b d d d d d d a a a a a a Time = 0 Time = 1 Time = 4 Time = 5 Time = 2 Time = 3 Buffer rotation Circular buffer B B Pipe stages: a, b, c, d

Single circular buffer One serial data flow path All data flows are of same type same size Multiple buffers Multiple data flow paths Different type, size Background - acyclic dataflow a a b b c c d d

A A A A A A B B B B B B c b d a Time = 3 Time = 0 Time = 2 Time = 4 Time = 1 Time = 5 A more complete picture 3. Life-time spent in pipeline 4. Buffer dead 2. Buffer live Circular buffer A, B 1. Buffer ready(raw data, e.g. ATR images) Pipe stages: a, b, c, d Head pointer

How does it work? • Raw data is dumped into the buffer from the data sources • A head pointer keeps incrementing • Buffer is ready, but not live (active in pipeline) yet • Example, ATR image data coming from sensors • Buffer becomes live in pipeline • Raw data are consumed and/or forwarded • New data are produced/consumed • When a buffer is no longer needed by any pipeline stages, it is dead and recycled • Is everything really independent? • Yes! • At each snapshot, each module is operating on different data

A B b c b b d a c c a b b c c a d d a a d c b a d d What are we trading off? Speed computation intensity, parallelism,throughput,power Time Response time, delay Workload amount of computation, energy

N = 2,t = T / 2 N = 4,t = T / 4 3-D Design space navigation Workload N frames Time Speed

a b • 3-D table • Power • Time • Workload c d PT N Design flow C Source code IMPACCT pipeline code versioning DFG Pipelined C Source code COPPER power simulator Task-level constraints Power-aware schedule IMPACCT scheduler and mode selection System-level constraints

Scheduling case study:Wavelet compression (JPL) • Algorithm in C • Wavelet decomposition • Compression: "knob" to choose lossy factor or lossless • Example category • Dataflow, similar to DAATR • Finer grained, better structure • IMPACCT improvements • Transformation to enable pipelining • Exploit lossy factor in trade space

Wavelet Algorithm • Wavelet Decomposition • Quantization • Entropy coding

Wavelet Algorithm structure For all image blocks Initialization (check params, allocate memory) block init.,set params, read image block decomp(), (lossless FWT) • Sequential execution blocks • No data dependency between image blocks (remove overlap) Bit_plane_decomp, (set decomp param) (1st level entropy coding) Output result to file (bit_plane encoding)

Wavelet: experiments • Experiments being conducted • Checkpoints marked up manually • Initial power estimation obtained • Code being manually rewritten / restructured for pipelining • Appears better structured than UAV example • Trade space • High performance to low power • Pipelining in space and in time, similar to UAV example • Lossy compression parameter

Ongoing scheduling case study:Deep Impact • "Planning" level example • Coarse grained, system level • Hardware architecture • COTS PowerPC 750 babybed, emulating a Rad-Hard PPC at 4x=> Models the X-2000 architecture using DS1 software • COTS PowerPC 603e board, emulating I/O devices in real time • Software architecture • vxWorks, static priority driven, preemptive • JPL's own software architecture -- command based • 1/8 second time steps; 1-second control loops • Task set • 60 tasks to schedule, 255 priority levels

NASA Deep Impact project • Platform • X-2000 configurable architecture • to be using RAD 6000 (Rad-Hard PowerPC 750 @133MHz) • Testbed (JPL Autonomy Lab) • PPC 750 single-board computer -- runs flight software • Prototype @233MHz, Real flight @133MHz • COTS board, L1 only, no L2 cache • PowerPC 603e -- emulate the I/O devices • connected via compact PCI • DS1: Deep Space One (legacy flight software ) • Software architecture: • 8 Hz ticks, command based • running on top of vxWorks • Perfmon: performance monitoring utility in DS1 • 11 test activities • 60 tasks

Deep Impact example (cont'd) • Available form: Real-time Traces • Collected using Babybed • 90 seconds of trace, time-stamped tasks, L-1 cache • Input needed • Algorithm (not available) • Timing / power constraints (easy) • Functional constraints • Sequence of events • Combinations of illegal modes • Challenges • Modeling two layers of software architecture (RTOS + command)

Design Flow task allocation, component selection task model, timing /power constraints scheduler high-level simulator IMPACCT component library mode model mode selector power + timing estimation power profile, C program powersimulator Compiler low-level simulator COPPER executable

SMT Power Simulator • Simulator Features • Compatible with SimpleScalar 3.0b • Execute PISA and EV6 binaries • Portability – Run on most kinds of computers • Handling Simultaneous Multithreading • Run up to 8 threads simultaneously • Similar to UW SMT model • Power Aware Features • Same analytic power model as WATTCH • Clock Gating • Parameterized Models • 42 functional unit classifications (WATTCH has 12) • 10 dynamic activity factors (WATTCH has 4)

Event Accessed units in SMT Power Simulator Accessed units in WATTCH Cache Hit Cache Tag Cache Array Cache Cache Miss Cache Tag X 2 Cache Array X 2 Normal Integer Operation Integer ALU Integer Reg ALU Integer Reg Integer Mult Operation Integer MULT Integer Reg Normal FP Operation FP ALU FP Reg FP Mult Operation FP MULT FP Reg Examples of Module Classification • Functional Units include • Arithmetic units: ALU, FPU, etc • Control units: Instr decoder, etc • Memory units: Caches, CAM, etc • Buses: Result bus • Cache Access • Cache Hit • Read Tag & Data • Cache Miss • Read Tag • Update Tag & Data • Read Data • Arithmetic Operation: 4 groups • Int ALU: +, -, bit operations • Int MULT: ,  • FP ALU: +, - • FP MULT:, 

SMT Power Simulator • Project Status • Performance Simulator – Done • Power Simulator – Implementation is done • Power parameter verification on going • Verification Methodology • Analytic model • Proven models from WATTCH • Comparison with COTS processors • Motorola PowerPC 7450 • Intel mobile Pentium III • Alpha 21264

Freq (Mhz) Vtg (V) Typ (W) Max Max (Vec) Doze Nap Sleep Deep Sleep 533 1.8 14.3 15.2 17.8 TBD 1.6 0.8 0.41 600 1.8 16.1 17.1 20.0 TBD 1.8 0.9 0.46 667 1.8 17.9 19.0 22.2 TBD 2.1 1.0 0.51 Example of Verification with COTS Processors • Typical/Maximum Power Consumption • Typical -> Average power consumption of applications • Maximum -> Peak power consumption of applications • Benchmark simulations are needed to verify • Modules in operation • Deep Sleep: Nothing -> Static power dissipation • Sleep: PLL working -> Static + PLL power dissipation • Nap: BUS snooping -> Static + PLL + I/O power dissipation • Doze: No instruction fetch -> no information PowerPC 7450 Power Consumption

Integrated Management of Power Aware Computing & Communication Technologies