Power-aware Design - Part II Reduction & Management

Power-aware Design - Part IIReduction & Management EE202A (Fall 2004): Lecture #9

Reading List for This Lecture • Required • Sandy Irani, Sandeep Shukla, and Rajesh Gupta. Online Strategies for Dynamic Power Management in Systems with Multiple Power Saving States. ACM Transactions on Embedded Computing Systems, August 2003. http://portal.acm.org/citation.cfm?id=860180&jmp=cit&dl=GUIDE&dl=ACM • V. Raghunathan, C. Pererira, M.B. Srivastava, and R.K. Gupta. Energy-aware Wireless Systems with Adaptive Power-Fidelity Trade-offs. Accepted for IEEE Transactions on VLSI Systems. http://www.ee.ucla.edu/~vijay/files/tvlsi04_dvs.pdf • V. Raghunathan, S. Ganeriwal, C. Schurgers, and M.B. Srivastava. Energy Efficient Wireless Packet Scheduling and Fair Queuing. ACM Transactions in Embedded Computing Systems, February 2004. http://www.ee.ucla.edu/~vijay/files/tecs04_wfq.pdf • Recommended • C. Schurgers, V. Raghunathan, and M.B. Srivastava. Power Management for Energy-aware Communication Systems. ACM Transactions on Embedded Computing Systems, August 2003.http://www.ee.ucla.edu/~vijay/files/tecs03_dpm.pdf • Yao, F.; Demers, A.; Shenker, S. A scheduling model for reduced CPU energy. Proceedings of IEEE 36th Annual Foundations of Computer Science, Milwaukee, WI, USA, 23-25 Oct. 1995. p.374-82. • Gruian, F. Hard real-time scheduling for low-energy using stochastic data and DVS processors. Proceedings of the 2001 ACM International Symposium on Low power electronics and design, August 2001. p.46-51. • M. Anand, E. Nightingale, & J Flinn. Self-tuning Wireless Network Power Management. ACM MobiCom 2003. http://portal.acm.org/citation.cfm?id=939004&jmp=indexterms&coll=portal&dl=GUIDE • Yung-Hsiang Lu, Luca Benini , Giovanni De Micheli. Power -Aware Operating Systems for Interactive Systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, no. 2, April 2002.http://citeseer.nj.nec.com/lu02poweraware.html • Others • None

Power in the Digital Hardware

Power Consumption in CMOS Digital Logic • Dynamic power consumption • charging and discharging capacitors • Short circuit currents • short circuit path between supply rails during switching • Leakage • leaking diodes and transistors • problem even when in standby!

Power Consumption in CMOS Digital logic (contd.) P = A.C.V2.f + A.Isw.V.f + Ileak.V where A = activity factor (probability of 0  1 transition) C = total chip capacitance V = total voltage swing, usually near the power supply voltage f = clock frequency Isw = short circuit current when logic level changes Ileak = leakage current in diodes and transistors

Why not simply lower V? • Total P can be minimized by lower V • lower V are a natural result of smaller feature sizes • But… transistor speeds decrease dramatically as V is reduced to close to “threshold voltage” • performance goals may not be met • td = CV / k(V-Vt) where  is between 1-2 • Why not lower this “threshold voltage”? • makes noise margin and Ileak worse! • Need to do smarter voltage scaling!

2 P = C V f a “Event-Driven” “Continuous” Latency is Important Only Throughput is (Burst throughput) Important Reduce V Make f low or 0 Increase h/w and Shutdown when algorithmic concurrency inactive e.g., Speech Coding e.g., X Display Server Video Compression Disk I/O Reduce C a Communication Energy efficient s/w System partitioning Efficient Circuits & Layouts Approaches to Energy Efficiency

Speed vs. Voltage 7.0 N o r m a l i z e d 5.0 D e l a y 3.0 1.0 1.0 1.5 2.0 2.5 3.0 Supply Voltage, V

Reducing the Supply Voltage: an Architectural Approach • Operate at reduced voltage at lower speed • Use architecture optimization to compensate for slower operation • e.g. concurrency, pipelining via compiler techniques • Architecture bottlenecks limit voltage reduction • degradation of speed-up • interconnect overheads • Similar idea for memory: slower and parallel Trade-off AREA for lower POWER

7.0 p 5.0 u d e e p S 3.0 1.0 1 2 3 4 5 6 7 8 Parallelism, N Example: Voltage-Parallelism Trade-off 7.0 Ideal Speedup d e l i z a m o r N y l a D e 5.0 3.0 1.0 1.0 1.5 2.0 2.5 3.0 Supply Voltage, V

Critical path delay: Tadder + Tcomparator = 25 ns • Frequency: fref = 40 MHz • Total switched capacitance = Cref • Vdd = Vref = 5V • Power for reference datapath = Pref = CrefVref2fref Example: Reference Datapath from “Digital Integrated Circuits” by Rabaey

The clock rate can be reduced by x2 with the same throughput: fpar = fref/2 = 20 MHz • Total switched capacitance = Cpar = 2.15Cref • Voar = Vref/1.7 • Ppar = (2.15Cref)(Vref/1.7)2(fref /2) = 0.36Pref Parallel Datapath from “Digital Integrated Circuits” by Rabaey

fpipe = fref Cpipe = 1.1Cref Vpipe = Vref/1.7 • Voltage can be dropped while maintaining the original throughput • Pipe = CpipeVpipe2fpipe = (1.1Cref)(Vref/1.7)2fref = 0.37Pref Pipelined Datapath from “Digital Integrated Circuits” by Rabaey

Datapath Architecture-Power Trade-off Summary

p u d e e p S r e w o P d e z i l a m r o N Example of Voltage Scaling 3.0 7.0 % CommunicationOverhead Ideal Speedup 2.0 5.0 Actual Speedup 1.0 3.0 Supply Voltage (Fixed Throughput) 1.0 0.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number Of Processors, N Number of Processors, N 1.0 0.8 x3.3 reduction 0.6 0.4 0.2 1 2 3 4 5 6 7 8 Number of Processors, N

Low Power Software Techniques

Low-power Software • Wireless industry  Constantly evolving standards • Systems have to be flexible and adaptable • Significant portion of system functionality is implemented as software running on a programmable processor • Software drives the underlying hardware • Hence, it can significantly impact system power consumption • Significant energy savings can be obtained by clever software design.

Low-power Software Strategies CPU • Code running on CPU • Code optimizations for low power • Code accessing memory objects • SW optimizations for memory • Data flowing on the buses • I/O coding for low power • Compiler controlled power management Cache Memory

Code Optimizations for Low Power • High-level operations (e.g. C statement) can be compiled into different instruction sequences • different instructions & ordering have different power • Instruction Selection • Select a minimum-power instruction mix for executing a piece of high level code • Instruction Packing & Dual Memory Loads • Two on-chip memory banks • Dual load vs. two single loads • Almost 50% energy savings

Code Optimizations for Low Power (contd.) • Reorder instructions to reduce switching effect at functional units and I/O buses • E.g. Cold scheduling minimizes instruction bus transitions [Su94] • Operand swapping • Swap the operands at the input of multiplier • Result is unaltered, but power changes significantly! • Other standard compiler optimizations • Intermediate level: Software pipelining, dead code elimination, redundancy elimination • Low level: Register allocation and other machine specific optimizations • Use processor-specific instruction styles • e.g. on ARM the default int type is ~ 20% more efficient than char or short as the latter result in sign or zero extension • e.g. on ARM the conditional instructions can be used instead of branches

Minimizing Memory Access Costs • Reduce memory access, make better use of registers • Register access consumes power << than memory access • Straightforward way: minimize number of read-write operations, e.g. • Cache optimizations • Reorder memory accesses to improve cache hit rates • Can use existing techniques for high-performance code generation

Minimizing Memory Access Costs (contd.) • Loop optimizations such as loop unrolling, loop fusion also reduce memory power consumption • More effective: explicitly target minimization of switching activity on I/O busses and exploiting memory hierarchy • Data allocation to minimize I/O bus transitions • e.g. mapping large arrays with known access patterns to main memory to minimize address bus transitions • works in conjunction with coding of address busses • Exploiting memory hierarchy • e.g. organizing video and DSP data to maximize the higher levels (lower power) of memory hierarchy

Energy Efficient I//O Encoding • C of system busses is >> C inside chips • large amount of power goes to I/O interfaces • 10-15% in uPs, 25-50% in FPGAs, 50-80% in logic • encoding bus data can reduce the power significantly • but need to handle encoding/decoding cost (power, latency) Subsystem #1 Subsystem #2 bus ENC DEC ENC DEC control

Examples • Compression to remove redundancy • Gray code on address busses • addresses usually increment sequentially by 1 • modified code that increments by 4 or 8 for word oriented CPUs • T0 code for address busses • add redundant INC line • INC=0 : address is equal to the bus lines • INC=1 : Tx freezes the other bus lines, and Rx increments the previously transmitted address by a pre-agreed stride • Better than Gray code: asymptotically zero transitions for sequences

Examples (contd.) • Bus-Invert Coding • transmit D or invert(W), whichever results in fewer transitions from the previous transmitted code • an extra signal indicates polarity • performance • at most N/2 lines switch • average: code is optimal for 1-bit redundancy codes • better for small N (25% for N=2, 18.2% for N=8, 14.6% for N=16) • partition into k subbusses with k polarity bits • but, no longer optimal among redundant codes • Encode based on statistical analysis of bus traces • calculate spatio-temporal correlation (on-line or off-line)

Examples (contd.) • Mixed bus encoding T0_BI • Use two redundant lines: INC and INV • Good for shared address/data busses • Use SEL line of the bus to distinguish data and address • Use T0 when SEL indicates address, BI otherwise • Choice depends on type of bus • data: busses: random white noise • address busses: spatio-temportal correlations

Example: Normalized # of Transitions for Typical UNIX Files 100% from [Stan97]

Power Management via Shutdown

Shutdown for Energy Saving Blocked “Off” Active “On” • Subsystems may have small duty factors • CPU, disk, wireless interface are often idle • Huge difference between “on” & “off” power • Some Low-Power CPUs: StrongARM 400mW (active)/ 50 mW (idle) / 0.16 mW (sleep) • 2.5” Hard Disk [Harris95]: 1.35W (idle spinning) / 0.4W (standby) / 0.2W (sleep) / 4.7W (start-up) Tblock Tactive ideal improvement = 1 + Tblock/Tactive

Potential CPU Power Reduction in a Wireless X Terminal • 96-98% time spent in the blocked state • Average time in the blocked state is short (<< a second)

Generic Power-managed System Power Manager observation observation • An abstract & flexible interface between power-manageable components (chips, disk driver, display driver etc.) & the power manager • but need insight on how & when to power manage • power management policy • Essentially PM is a controller that needs to be synthesized • Components (service providers) with several internal states • corresponding to power and service levels • can be abstracted as a power state machine • power and service annotation on states • power and delay annotation on edges command (on, off) Service Requestor Service Provider Queue request

Example: SA-1100 CPU 400 mW RUN • RUN • IDLE • CPU stopped when not in use • Monitoring for interrupts • SLEEP • Shutdown on-chip activity 10 ms 90 ms 10 ms 160 ms IDLE SLEEP 90 ms 50 mW 0.16 mW

Example: Fujitsu MHF 2043 AT read/write Working: 2.2 W(spinning + I/O) Idle: 0.95 W(spinning) I/O done spin up4.4 J, 1.6 s shutdown0.36 J, 0.67 s Sleep: 0.13 W(stop spinning)

Example: IBM Mobile Hard Drive

When is DPM useful? Ptr Ttr Blocked “Off” Active “On” • If Ttr=0, Ptr=0 then DPM policy is trivial • Stop a component when it is not needed • If, as is usual, Ttr!=0, Ptr!=0 • shutdown only when idleness is going to be long enough to make it worthwhile • Complex decision if the time spent in state is not deterministic Ptr Ttr

Problems in Shutdown • Cost of restarting: latency vs. power trade-off • increase in latency (response time) • e.g. time to save restore CPU state, spin up disk • increase in power consumption • e.g. higher start-up current in disks • When to Shutdown Optimal vs. Idle Time Threshold vs. Predictive • When to Wakeup Optimal vs. On-demand vs. Predictive • Cross-over point for shutdown to be effective

BLOCK BLOCK RUN RUN T [i-1] T [i] T [i] T [i+1] block run block run IDLE OVER REDUCED OVER WAIT HEAD POWER MODE HEAD Conventional Reactive Approach “Go to Reduced Power Mode after the user has been idle for a few seconds/minutes, and restart on demand”

Predictive Shutdown Approach “Use computation history to predict whetherTblock[i] is large enough ( Tblock[i]  Tcost )” • Example of a heuristic: Tblock[i] TcostTrun[i]Ton_threshold • up to x20 power reduction with 3% slowdown on X server traces • compared to x2 with non-predictive • Eliminates power wasted while waiting for time-out

Pre-wakeup • System wakeup takes time, adversely hurting the performance • One could pre-wakeup the system by predicting the occurrence of the next wakeup signal R I R R I R delay R E S W R R E S W I R I’ I’

Breakeven Point • Breakeven point: minimum idle time that would make it worthwhile to shutdown • DPM worthwhile when TBE < Average Tidle

DPM Approaches: Predictive • Exploit correlation between the recent past & future • Predict idle time and schedule shutdown and/or wakeup accordingly • Static techniques • E.g. fixed timeout Tthreshold with on-demand wakeup • P(Tidle > Tthreshold + TBE | Tidle > Tthreshold) ≈ 1 • Tthreshold = TBE yields energy consumption not more than 2x worse than ideal oracle policy • Worst case when point activities are separated by Tidle = 2TBE • Adaptive techniques • E.g. maintain set of time out values to figure out how successful it would have been • E.g. weighted timeouts where weights based on performance relative to oracle policy • E.g. increase and decrease timeout based on its performance

DPM Approaches: Stochastic • Predictive approaches handle workload uncertainty • But assume deterministic response and transition time • Abstract system model introduces uncertainty • Predictive algorithms based on 2-state model of system • Real-life systems have multiple power states • Decide not only when to change state but also to which state • Stochastic approaches formulate problem of DPM policy as an optimization under uncertainty • Service requestor (SR): a Markov chain with state set R which models the arrival of service requests • Service provider (SP): a controlled Markov chain with S states that models the system. The states represent modes of operation of the system and transitions are probabilistic. The probabilities are controlled by the power manager. • Power manager (PM) which implements a function f: S x R  A from the state set of SR and SP to a set of possible commands A. Each function represents a decision process: the PM observes the state of the system and the workload, takes a decision, and issues a command to control the future state of the system • Cost metrics which associate power and performance values with each system state-command pair in S x R x A • Captures global view of the system with possibly multiple inactive states and resources • Performance and power are expected values

Competitive Analysis • DPM is an inherently on-line problem • Make decisions without seeing the entire input • E.g. no way of knowing the length of an idle period until it ends • Competitive ratio: a way to characterize solutions to such problems • Compares the cost of an on-line algorithm with that of an optimal off-line one that knows the input in advance (the “oracle” solution) • An algorithm is c-competitive if for any input the cost of the on-line approach is bounded by c times the cost of the optimal off-line approach for the same input • Competitive Ration (CR) of an algorithm is the infimum over all c such that the algorithm is c-competitive • Competitive analysis done by case analysis of various adversarial scenarios or via formal theorem proving • Provides assurance about worst case performance • But this can be quite pessimistic!

Classical Results on CR • The best CR achieved by any deterministic on-line algorithm is 2 • So, the fixed timeout Tthreshold = TBE is optimal in this sense • Methods exist to determine an on-line DPM algorithm for a given idle period distribution such that for any distribution the correspond DPM algorithm is within a factor of e/(e-1) ≈ 1.58 of the optimal off-line algorithm • The result is tight: there is at least one distribution for which the ratio is exactly e/(e-1)

Multi-state DPM with Optimal CR • Let there be k+1 states • Let State k be the shut-down state and 0 be the active state • Let i be the energy dissipation rate at state i • Let i be the total energy dissipated to move back to State 0 • States are ordered such that i+1  i • k = 0 and 0 = 0 (without loss of generality). • Power down energy cost can be incorporated in the power up cost for analysis (if additive). Now formulate an optimization problem to determine the state transition thresholds.

Lower Envelope Idea State1 State2 State3 State 4 • LEA can be deterministic or probabilistic • DLEA is 2 competitive while PLEA is e/(e-1) competitive • Learn p(t): On-line Probability Based Algorithm (OPBA) • Histogram of previous w idle intervals, and thresholds calculated based on that Energy For each state i, plot: Time t1 t2 t3

Implementing DPM • Clock gating • Supply shutdown • Display shutdown • Motor shutdown

Shutdown vs. Variable Voltage

Voltage Reduction is Better • Example: task with 100ms deadline, requires 50ms CPU time at full speed • normal system gives 50ms computation, 50ms idle/stopped time • half speed/voltage system gives 100ms computation, 0ms idle • same number of CPU cycles but 1/4 energy reduction T1 T2 T1 T2 Same work, lower energy Speed Idle Task Task Time

Problem with Voltage Reduction • Voltage gets dictated by the tightest (critical) timing constraint • not a problem if latency not important • throughput can always be improved by pipelining, parallelism etc. • but, real systems have bursty throughput and latency critical tasks Solution: dynamically vary the voltage!

Power-aware Design - Part II Reduction & Management