1 / 170

Power-aware Design - Part II Reduction & Management

Power-aware Design - Part II Reduction & Management. EE202A (Fall 2004): Lecture #9. Reading List for This Lecture. Required

kare
Télécharger la présentation

Power-aware Design - Part II Reduction & Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Power-aware Design - Part IIReduction & Management EE202A (Fall 2004): Lecture #9

  2. Reading List for This Lecture • Required • Sandy Irani, Sandeep Shukla, and Rajesh Gupta. Online Strategies for Dynamic Power Management in Systems with Multiple Power Saving States. ACM Transactions on Embedded Computing Systems, August 2003. http://portal.acm.org/citation.cfm?id=860180&jmp=cit&dl=GUIDE&dl=ACM • V. Raghunathan, C. Pererira, M.B. Srivastava, and R.K. Gupta. Energy-aware Wireless Systems with Adaptive Power-Fidelity Trade-offs. Accepted for IEEE Transactions on VLSI Systems. http://www.ee.ucla.edu/~vijay/files/tvlsi04_dvs.pdf • V. Raghunathan, S. Ganeriwal, C. Schurgers, and M.B. Srivastava. Energy Efficient Wireless Packet Scheduling and Fair Queuing. ACM Transactions in Embedded Computing Systems, February 2004. http://www.ee.ucla.edu/~vijay/files/tecs04_wfq.pdf • Recommended • C. Schurgers, V. Raghunathan, and M.B. Srivastava. Power Management for Energy-aware Communication Systems. ACM Transactions on Embedded Computing Systems, August 2003.http://www.ee.ucla.edu/~vijay/files/tecs03_dpm.pdf • Yao, F.; Demers, A.; Shenker, S. A scheduling model for reduced CPU energy. Proceedings of IEEE 36th Annual Foundations of Computer Science, Milwaukee, WI, USA, 23-25 Oct. 1995. p.374-82. • Gruian, F. Hard real-time scheduling for low-energy using stochastic data and DVS processors. Proceedings of the 2001 ACM International Symposium on Low power electronics and design, August 2001. p.46-51. • M. Anand, E. Nightingale, & J Flinn. Self-tuning Wireless Network Power Management. ACM MobiCom 2003. http://portal.acm.org/citation.cfm?id=939004&jmp=indexterms&coll=portal&dl=GUIDE • Yung-Hsiang Lu, Luca Benini , Giovanni De Micheli. Power -Aware Operating Systems for Interactive Systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, no. 2, April 2002.http://citeseer.nj.nec.com/lu02poweraware.html • Others • None

  3. Power in the Digital Hardware

  4. Power Consumption in CMOS Digital Logic • Dynamic power consumption • charging and discharging capacitors • Short circuit currents • short circuit path between supply rails during switching • Leakage • leaking diodes and transistors • problem even when in standby!

  5. Power Consumption in CMOS Digital logic (contd.) P = A.C.V2.f + A.Isw.V.f + Ileak.V where A = activity factor (probability of 0  1 transition) C = total chip capacitance V = total voltage swing, usually near the power supply voltage f = clock frequency Isw = short circuit current when logic level changes Ileak = leakage current in diodes and transistors

  6. Why not simply lower V? • Total P can be minimized by lower V • lower V are a natural result of smaller feature sizes • But… transistor speeds decrease dramatically as V is reduced to close to “threshold voltage” • performance goals may not be met • td = CV / k(V-Vt) where  is between 1-2 • Why not lower this “threshold voltage”? • makes noise margin and Ileak worse! • Need to do smarter voltage scaling!

  7. 2 P = C V f a “Event-Driven” “Continuous” Latency is Important Only Throughput is (Burst throughput) Important Reduce V Make f low or 0 Increase h/w and Shutdown when algorithmic concurrency inactive e.g., Speech Coding e.g., X Display Server Video Compression Disk I/O Reduce C a Communication Energy efficient s/w System partitioning Efficient Circuits & Layouts Approaches to Energy Efficiency

  8. Speed vs. Voltage 7.0 N o r m a l i z e d 5.0 D e l a y 3.0 1.0 1.0 1.5 2.0 2.5 3.0 Supply Voltage, V

  9. Reducing the Supply Voltage: an Architectural Approach • Operate at reduced voltage at lower speed • Use architecture optimization to compensate for slower operation • e.g. concurrency, pipelining via compiler techniques • Architecture bottlenecks limit voltage reduction • degradation of speed-up • interconnect overheads • Similar idea for memory: slower and parallel Trade-off AREA for lower POWER

  10. 7.0 p 5.0 u d e e p S 3.0 1.0 1 2 3 4 5 6 7 8 Parallelism, N Example: Voltage-Parallelism Trade-off 7.0 Ideal Speedup d e l i z a m o r N y l a D e 5.0 3.0 1.0 1.0 1.5 2.0 2.5 3.0 Supply Voltage, V

  11. Critical path delay: Tadder + Tcomparator = 25 ns • Frequency: fref = 40 MHz • Total switched capacitance = Cref • Vdd = Vref = 5V • Power for reference datapath = Pref = CrefVref2fref Example: Reference Datapath from “Digital Integrated Circuits” by Rabaey

  12. The clock rate can be reduced by x2 with the same throughput: fpar = fref/2 = 20 MHz • Total switched capacitance = Cpar = 2.15Cref • Voar = Vref/1.7 • Ppar = (2.15Cref)(Vref/1.7)2(fref /2) = 0.36Pref Parallel Datapath from “Digital Integrated Circuits” by Rabaey

  13. fpipe = fref Cpipe = 1.1Cref Vpipe = Vref/1.7 • Voltage can be dropped while maintaining the original throughput • Pipe = CpipeVpipe2fpipe = (1.1Cref)(Vref/1.7)2fref = 0.37Pref Pipelined Datapath from “Digital Integrated Circuits” by Rabaey

  14. Datapath Architecture-Power Trade-off Summary

  15. p u d e e p S r e w o P d e z i l a m r o N Example of Voltage Scaling 3.0 7.0 % CommunicationOverhead Ideal Speedup 2.0 5.0 Actual Speedup 1.0 3.0 Supply Voltage (Fixed Throughput) 1.0 0.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number Of Processors, N Number of Processors, N 1.0 0.8 x3.3 reduction 0.6 0.4 0.2 1 2 3 4 5 6 7 8 Number of Processors, N

  16. Low Power Software Techniques

  17. Low-power Software • Wireless industry  Constantly evolving standards • Systems have to be flexible and adaptable • Significant portion of system functionality is implemented as software running on a programmable processor • Software drives the underlying hardware • Hence, it can significantly impact system power consumption • Significant energy savings can be obtained by clever software design.

  18. Low-power Software Strategies CPU • Code running on CPU • Code optimizations for low power • Code accessing memory objects • SW optimizations for memory • Data flowing on the buses • I/O coding for low power • Compiler controlled power management Cache Memory

  19. Code Optimizations for Low Power • High-level operations (e.g. C statement) can be compiled into different instruction sequences • different instructions & ordering have different power • Instruction Selection • Select a minimum-power instruction mix for executing a piece of high level code • Instruction Packing & Dual Memory Loads • Two on-chip memory banks • Dual load vs. two single loads • Almost 50% energy savings

  20. Code Optimizations for Low Power (contd.) • Reorder instructions to reduce switching effect at functional units and I/O buses • E.g. Cold scheduling minimizes instruction bus transitions [Su94] • Operand swapping • Swap the operands at the input of multiplier • Result is unaltered, but power changes significantly! • Other standard compiler optimizations • Intermediate level: Software pipelining, dead code elimination, redundancy elimination • Low level: Register allocation and other machine specific optimizations • Use processor-specific instruction styles • e.g. on ARM the default int type is ~ 20% more efficient than char or short as the latter result in sign or zero extension • e.g. on ARM the conditional instructions can be used instead of branches

  21. Minimizing Memory Access Costs • Reduce memory access, make better use of registers • Register access consumes power << than memory access • Straightforward way: minimize number of read-write operations, e.g. • Cache optimizations • Reorder memory accesses to improve cache hit rates • Can use existing techniques for high-performance code generation

  22. Minimizing Memory Access Costs (contd.) • Loop optimizations such as loop unrolling, loop fusion also reduce memory power consumption • More effective: explicitly target minimization of switching activity on I/O busses and exploiting memory hierarchy • Data allocation to minimize I/O bus transitions • e.g. mapping large arrays with known access patterns to main memory to minimize address bus transitions • works in conjunction with coding of address busses • Exploiting memory hierarchy • e.g. organizing video and DSP data to maximize the higher levels (lower power) of memory hierarchy

  23. Energy Efficient I//O Encoding • C of system busses is >> C inside chips • large amount of power goes to I/O interfaces • 10-15% in uPs, 25-50% in FPGAs, 50-80% in logic • encoding bus data can reduce the power significantly • but need to handle encoding/decoding cost (power, latency) Subsystem #1 Subsystem #2 bus ENC DEC ENC DEC control

  24. Examples • Compression to remove redundancy • Gray code on address busses • addresses usually increment sequentially by 1 • modified code that increments by 4 or 8 for word oriented CPUs • T0 code for address busses • add redundant INC line • INC=0 : address is equal to the bus lines • INC=1 : Tx freezes the other bus lines, and Rx increments the previously transmitted address by a pre-agreed stride • Better than Gray code: asymptotically zero transitions for sequences

  25. Examples (contd.) • Bus-Invert Coding • transmit D or invert(W), whichever results in fewer transitions from the previous transmitted code • an extra signal indicates polarity • performance • at most N/2 lines switch • average: code is optimal for 1-bit redundancy codes • better for small N (25% for N=2, 18.2% for N=8, 14.6% for N=16) • partition into k subbusses with k polarity bits • but, no longer optimal among redundant codes • Encode based on statistical analysis of bus traces • calculate spatio-temporal correlation (on-line or off-line)

  26. Examples (contd.) • Mixed bus encoding T0_BI • Use two redundant lines: INC and INV • Good for shared address/data busses • Use SEL line of the bus to distinguish data and address • Use T0 when SEL indicates address, BI otherwise • Choice depends on type of bus • data: busses: random white noise • address busses: spatio-temportal correlations

  27. Example: Normalized # of Transitions for Typical UNIX Files 100% from [Stan97]

  28. Power Management via Shutdown

  29. Shutdown for Energy Saving Blocked “Off” Active “On” • Subsystems may have small duty factors • CPU, disk, wireless interface are often idle • Huge difference between “on” & “off” power • Some Low-Power CPUs: StrongARM 400mW (active)/ 50 mW (idle) / 0.16 mW (sleep) • 2.5” Hard Disk [Harris95]: 1.35W (idle spinning) / 0.4W (standby) / 0.2W (sleep) / 4.7W (start-up) Tblock Tactive ideal improvement = 1 + Tblock/Tactive

  30. Potential CPU Power Reduction in a Wireless X Terminal • 96-98% time spent in the blocked state • Average time in the blocked state is short (<< a second)

  31. Generic Power-managed System Power Manager observation observation • An abstract & flexible interface between power-manageable components (chips, disk driver, display driver etc.) & the power manager • but need insight on how & when to power manage • power management policy • Essentially PM is a controller that needs to be synthesized • Components (service providers) with several internal states • corresponding to power and service levels • can be abstracted as a power state machine • power and service annotation on states • power and delay annotation on edges command (on, off) Service Requestor Service Provider Queue request

  32. Example: SA-1100 CPU 400 mW RUN • RUN • IDLE • CPU stopped when not in use • Monitoring for interrupts • SLEEP • Shutdown on-chip activity 10 ms 90 ms 10 ms 160 ms IDLE SLEEP 90 ms 50 mW 0.16 mW

  33. Example: Fujitsu MHF 2043 AT read/write Working: 2.2 W(spinning + I/O) Idle: 0.95 W(spinning) I/O done spin up4.4 J, 1.6 s shutdown0.36 J, 0.67 s Sleep: 0.13 W(stop spinning)

  34. Example: IBM Mobile Hard Drive

  35. When is DPM useful? Ptr Ttr Blocked “Off” Active “On” • If Ttr=0, Ptr=0 then DPM policy is trivial • Stop a component when it is not needed • If, as is usual, Ttr!=0, Ptr!=0 • shutdown only when idleness is going to be long enough to make it worthwhile • Complex decision if the time spent in state is not deterministic Ptr Ttr

  36. Problems in Shutdown • Cost of restarting: latency vs. power trade-off • increase in latency (response time) • e.g. time to save restore CPU state, spin up disk • increase in power consumption • e.g. higher start-up current in disks • When to Shutdown Optimal     vs.     Idle Time Threshold     vs.     Predictive • When to Wakeup Optimal     vs.     On-demand     vs.     Predictive • Cross-over point for shutdown to be effective

  37. BLOCK BLOCK RUN RUN T [i-1] T [i] T [i] T [i+1] block run block run IDLE OVER REDUCED OVER WAIT HEAD POWER MODE HEAD Conventional Reactive Approach “Go to Reduced Power Mode after the user has been idle for a few seconds/minutes, and restart on demand”

  38. Predictive Shutdown Approach “Use computation history to predict whetherTblock[i] is large enough ( Tblock[i]  Tcost )” • Example of a heuristic: Tblock[i] TcostTrun[i]Ton_threshold • up to x20 power reduction with 3% slowdown on X server traces • compared to x2 with non-predictive • Eliminates power wasted while waiting for time-out

  39. Pre-wakeup • System wakeup takes time, adversely hurting the performance • One could pre-wakeup the system by predicting the occurrence of the next wakeup signal R I R R I R delay R E S W R R E S W I R I’ I’

  40. Breakeven Point • Breakeven point: minimum idle time that would make it worthwhile to shutdown • DPM worthwhile when TBE < Average Tidle

  41. DPM Approaches: Predictive • Exploit correlation between the recent past & future • Predict idle time and schedule shutdown and/or wakeup accordingly • Static techniques • E.g. fixed timeout Tthreshold with on-demand wakeup • P(Tidle > Tthreshold + TBE | Tidle > Tthreshold) ≈ 1 • Tthreshold = TBE yields energy consumption not more than 2x worse than ideal oracle policy • Worst case when point activities are separated by Tidle = 2TBE • Adaptive techniques • E.g. maintain set of time out values to figure out how successful it would have been • E.g. weighted timeouts where weights based on performance relative to oracle policy • E.g. increase and decrease timeout based on its performance

  42. DPM Approaches: Stochastic • Predictive approaches handle workload uncertainty • But assume deterministic response and transition time • Abstract system model introduces uncertainty • Predictive algorithms based on 2-state model of system • Real-life systems have multiple power states • Decide not only when to change state but also to which state • Stochastic approaches formulate problem of DPM policy as an optimization under uncertainty • Service requestor (SR): a Markov chain with state set R which models the arrival of service requests • Service provider (SP): a controlled Markov chain with S states that models the system. The states represent modes of operation of the system and transitions are probabilistic. The probabilities are controlled by the power manager. • Power manager (PM) which implements a function f: S x R  A from the state set of SR and SP to a set of possible commands A. Each function represents a decision process: the PM observes the state of the system and the workload, takes a decision, and issues a command to control the future state of the system • Cost metrics which associate power and performance values with each system state-command pair in S x R x A • Captures global view of the system with possibly multiple inactive states and resources • Performance and power are expected values

  43. Competitive Analysis • DPM is an inherently on-line problem • Make decisions without seeing the entire input • E.g. no way of knowing the length of an idle period until it ends • Competitive ratio: a way to characterize solutions to such problems • Compares the cost of an on-line algorithm with that of an optimal off-line one that knows the input in advance (the “oracle” solution) • An algorithm is c-competitive if for any input the cost of the on-line approach is bounded by c times the cost of the optimal off-line approach for the same input • Competitive Ration (CR) of an algorithm is the infimum over all c such that the algorithm is c-competitive • Competitive analysis done by case analysis of various adversarial scenarios or via formal theorem proving • Provides assurance about worst case performance • But this can be quite pessimistic!

  44. Classical Results on CR • The best CR achieved by any deterministic on-line algorithm is 2 • So, the fixed timeout Tthreshold = TBE is optimal in this sense • Methods exist to determine an on-line DPM algorithm for a given idle period distribution such that for any distribution the correspond DPM algorithm is within a factor of e/(e-1) ≈ 1.58 of the optimal off-line algorithm • The result is tight: there is at least one distribution for which the ratio is exactly e/(e-1)

  45. Multi-state DPM with Optimal CR • Let there be k+1 states • Let State k be the shut-down state and 0 be the active state • Let i be the energy dissipation rate at state i • Let i be the total energy dissipated to move back to State 0 • States are ordered such that i+1  i • k = 0 and 0 = 0 (without loss of generality). • Power down energy cost can be incorporated in the power up cost for analysis (if additive). Now formulate an optimization problem to determine the state transition thresholds.

  46. Lower Envelope Idea State1 State2 State3 State 4 • LEA can be deterministic or probabilistic • DLEA is 2 competitive while PLEA is e/(e-1) competitive • Learn p(t): On-line Probability Based Algorithm (OPBA) • Histogram of previous w idle intervals, and thresholds calculated based on that Energy For each state i, plot: Time t1 t2 t3

  47. Implementing DPM • Clock gating • Supply shutdown • Display shutdown • Motor shutdown

  48. Shutdown vs. Variable Voltage

  49. Voltage Reduction is Better • Example: task with 100ms deadline, requires 50ms CPU time at full speed • normal system gives 50ms computation, 50ms idle/stopped time • half speed/voltage system gives 100ms computation, 0ms idle • same number of CPU cycles but 1/4 energy reduction T1 T2 T1 T2 Same work, lower energy Speed Idle Task Task Time

  50. Problem with Voltage Reduction • Voltage gets dictated by the tightest (critical) timing constraint • not a problem if latency not important • throughput can always be improved by pipelining, parallelism etc. • but, real systems have bursty throughput and latency critical tasks Solution: dynamically vary the voltage!

More Related