Lecture 27 Power Aware Architecture Design

CS 15-447: Computer Architecture Lecture 27Power Aware Architecture Design November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu http://www.qatar.cmu.edu/~msakr/15447-f08

Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006 ??%/year  Sea change in chip design—what is emerging? • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present

Three walls • ILP Wall: • Wall: not enough parallelism available in one thread • Very costly to find more • Implications:  cant continue to grow IPC • VLIW? SIMD ISA extensions? • Memory Wall: • Growing gap between DRAM and Processor speed • Caching helps, but only so much • Implications:  cache misses are getting more expensive • Multithreaded processors? • Physics/Power Wall: • Cant continue to shrink devices; running into physical limits • Power dissipation is also increasing (more today) • Implications:  cant rely on performance boost from shrinking transistors • But we will continue to get more transistors

Multithreaded Processors • What support is needed? • I can use it to help ILP as well • Which designs help ILP in the picture to the right?

Power-Efficient Processor Design • Goals: • Understand why energy efficiency is important • Learn the sources of energy dissipation • Overview a selection of approaches to reduce energy

Why Worry About Power? • Embedded systems: • Battery life • High-end processors: • Cooling (costs $1 per chip per Watt if operating @ >40W) • Power cost:15 cents/KiloWatt hr (KWH) • A single 900 Watt server costs 100 USD /month to run, not including cooling costs! • Packaging • Reliability

Why worry about power -- Oakridge Lab. Jaguar • Current highest performance super computer • 1.3 sustained petaflops (quadrillion FP operations per second) • 45,000 processors, each quad-core AMD Opteron • 180,000 cores! • 362 Terabytes of memory; 10 petabytes disk space • Check top500.org for a list of the most powerful supercomputers • Power consumption? (without cooling) • 7MegaWatts! • 0.75 million USD/month to power • There is a green500.org that rates computers based on flops/Watt

Peak Power in Today’s CPUs • Alpha 21264 95W • AMD Athlon XP 67W • HP PA-8700 75W • IBM Power 4 135W • Intel Itanium 130W • Intel Xeon 59W Even worse when we consider power density (watt/cm2)

Where is This Power Coming From? • Sources of power consumption in CMOS: • Dynamic or active power (due to the switching of transistors) • Short-circuit power • Leakage power • High temperature increases power consumption • Silicon is a bad conductor: higher temperature ->higher leakage current->even higher temperature…

Power Consumption in CMOS • Dynamic Power Consumption • Charging and discharging capacitors Vdd Vdd E=CV2 E=CV2 In Out In Out 0 1 1 0 C C P=E*f=C*V2*f

Dynamic Power Consumption Capacitance: function of wire length, transistor size Clock frequency: increasing Power= *C*V2*f Supply voltage: has been dropping with successive process generations Activity factor: how often do wires switch

Power Consumption in CMOS • Short-circuit power • Both PMOS and NMOS are conducting Vdd Isc In Out 1/2 C About 2% of the overall power.

Power Consumption in CMOS • Leakage power – transistors are not perfect switches and they leak. Vdd In Out 0 1 Isub C 20% now, expect 40% in next technology and growing

Cooling • All of the consumed power has to be dissipated • Done by means of heat pipes, heat sinks, fans, etc. • Different segments use different cooling mechanisms. • Costs $1-$3 or more per chip per Watt if operating @ >40W • We may soon need budgets for liquid-cooling or refrigeration hardware.

Dynamic Power Consumption Capacitance: function of wire length, transistor size Clock frequency: increasing Power= *C*V2*f Supply voltage: has been dropping with successive process generations Activity factor: how often do wires switch

Voltage Scaling • Transistor switches slower at lower voltage. • Leakage current grows exponentially with decreases in threshold voltage • Leakage power goes through the roof

Technology Scaling: the Enabler • New process generation every 2-3 years • Ideal shrink for 30% reduction in size: • Voltage scales down by 30% • Gate delays are shortened by 30% ~50% frequency gain (500ps cycle = 2GHz clock, 333ps cycle = 3GHz clock) • Transistor density increases by 2X • 0.7X shrink on a side, 2X area reduction • Capacitance/transistor reduced by 30%

Ideal Process Shrink: the Results • 2/3 reduction in energy/transition (CV2 0.7x0.72 = 0.34X) • 1/2 reduction in power (CV2f  0.7x0.72 x 1.5= 0.5X • But twice as many transistors, or more if area increases • Power density unchanged Looks good!

Process Technology – the Reality* • Performance does not scale w/ frequency • New designs increase frequency by 2X • New designs use 2X-3X more transistors to get 1.4X-1.8X performance* • So, every new process generation: • Power goes up by about 2X (3X transistors * 2X switches * 1/3 energy) • Leakage power is also increasing • Power density goes up 30%~80% (2X power / 1.X area) • Will get worse in future technologies, because Voltage will scale down less *Source: “Power – the Next Frontier: a Microarchitecture Perspective”, Ronny Ronen, Keynote speech at PACS’02 Workshop.

Ugly Numbers*

The Bottom Line • Circuits and process scaling alone can no longer solve all power problems • SYSTEMS must also be power-aware • OS • Compilers • Architecture • Techniques at the architectural level are needed to reduce the absolute power dissipation as well as the power density

Microarchitectural Techniques for Power Reduction

A Superscalar Datapath Performance=N*f*IPC Function Units Architectural Register File Instruction Issue IQ FU1 F1 F2 D1 D2 FU2 ROB ARF FUm Fetch Decode/Dispatch LSQ EX Instruction dispatch D-cache Result/status forwarding buses Actually, it’s the whole system, but we focus on processor

Microarchitectural Techniques—General Approach • Dynamic power: • Reduce the activity factor • Reduce the switching capacitance (usually not possible) • Reduce the voltage/frequency (speedstep; e.g., 1.6 GHz pentium M can be clocked down to 600MHz, voltage can be dropped from 1.48V to 0.95V) • Leakage power: • Put some portions of the on-chip storage structures in a low-power stand-by mode or even completely shutting off the power supply to these partitions • Resizing • We usually give up some performance to save energy, but how much?

Guideline • If we reduce voltage, linear drop in maximum frequency (and performance) • “The cube law”: P=kV3 (~1%V=3%P) • If we use voltage scaling we can approximately trade 1% of performance loss for 3% of power reduction. • Any architectural technique that trades performance for power should do better than that (or at least as good). Otherwise simple voltage scaling can be used to achieve better tradeoffs.

Examples: Front-End Throttling • Speculation is used to increase performance • Wasted energy if it is wrong • Can we speculate only when we think we’ll be right? • Gating: temporarily prevent the new instructions from entering the pipeline • Use Gating to avoid speculation beyond the branches with low prediction accuracy • The number of unresolved low-confidence branches is used to determine when to gate the pipeline and for how long • Report 38% energy savings in the wrong-path instructions with about 1% of IPC loss

Front-End Throttling (continued) • Just-in Time Instruction Delivery • Fetch stage is throttled based on the number of in-flight instructions. • If the number of in-flight instructions exceeds a predetermined threshold, the fetch is throttled • Threshold is adjusted through the “tuning cycle” • Reasons for energy savings: • Fewer instructions are processed along the mispredicted path • Instruction spends fewer cycles in the issue queue

Energy Reduction in the Register Files • General solutions: • Use of multi-banked RFs. Each bank has fewer entries and fewer ports than the monolithic RF. • Problems: • Possible bank conflicts -> IPC loss • Overhead of the port arbitration logic • Use of the smaller cache-like structures to exploit the access locality

Energy Reduction in the Register Files • Value Aging Buffer • At the time of writeback, the results are written into a FIFO-style cache called VAB • The RF is updated only when the values are evicted from the VAB. • In many situations, this can be avoided because a register may be deallocated during its residency in the VAB • If a register is read from the VAB, there is no need to access the RF. • Some performance loss due to the sequential access to the VAB and the RF.

Isolation of short-lived operands

Out-of-Order Execution andIn-Order Retirement Ex Inst. Queue ARF F R D ROB In-order front end In-order retirement Out-of-order core

Register Renaming • Used tocope with false data dependencies. • A new physical register is allocated for EVERY new result • P6 style: ROB slots serve as physical registers LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, P2, 100 SUB P32, P31, P3 ADD P33, P32, P4

Register Renaming: the Implementation • Register Alias Table (RAT) maintains the mappings between logical and physical registers Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4

Register Renaming: the Implementation • Register Alias Table (RAT) maintains the mappings between logical and physical registers Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100

Register Renaming: the Implementation • Rename Table (RT) is used to maintain the mappings between logical and physical registers Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3

Register Renaming: the Implementation • Rename Table (RT) is used to maintain the mappings between logical and physical registers Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

Short-Lived Values • Definition: a value is short-lived if the destination register is renamed by the time of the result generation. • Identified one cycle before the result writeback • A large percentage of all generated results are short-lived for SPEC 2000 benchmarks. LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 RENAMER

Percentage of Short-Lived Values 96-entry ROB, 4-way processor As

Why Keep Them ? • Reasons for maintaining short-lived values: • Recovering from branch mispredictions • Reconstructing precise state if interrupts or exceptions occur LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

Energy-dissipating Events Ex Inst. Queue ARF F R D Write Write ROB In-order front end Read In-order retirement Out-of-order core

Isolating Short-Lived Values: the Idea Write short-lived values into a small dedicated RF (SRF) Ex Inst. Queue ARF Write F R D SRF Write In-order front end ROB Read LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 In-order retirement Out-of-order core

Energy Reduction in Caches • Dynamically resizable caches • Dynamically estimates the program requirements and adapts to the required cache size • Cache is upsized or downsized at the end of periodic intervals based on the value of the cache miss counter • Downsizing puts the higher-numbered sets into a low-leakage mode using sleep transistors • A bit mask is used to specify the number of address bits that are used for indexing into the set • The cache size always changes by a factor of two

Energy Reduction within the Execution Units • Gating off portions of the execution units • Disables the upper bits of the ALUs where they are not needed (for small operands) • Energy can be reduced by 54% for integer programs • Packaging multiple narrow-width operations in a single ALU in the same cycle • Steering instructions to FUs based on the criticality information • Critical instructions are steered to fast and power-hungry execution units, non-critical instructions are steered to slow and power-efficient units

Encoding Addresses for Low Power • Using Grey code for the addresses to reduce switching activity on the address buses (Su et.al., IEEE Design and Test, 1994) • Exploits the observation that programs often generate consecutive addresses • Grey code: there is only a single transition on the address bus when consecutive addresses are accessed • 37% reduction in the switching activity is reported • A Gray code encoder is placed at the transmitting end of the bus, and a decoder is needed at the receiving end

Encoding Data for Low Power • Bus-invert encoding • Uses redundancy to reduce the number of transitions • Adds one line to the bus to indicate if the actual data or its complement is transmitted • If the Hamming distance between the current value and the previous one is less than or equal to (n/2) (for n bits), the value is transmitted as such and the value of 0 is transmitted on the extra line. • Otherwise, the complement of the value is transmitted and the extra line is set to 1 • The average number of bus transitions per clock cycle is lowered by 25% as a result

OS and Compiler Techniques • Can compiler help? • Can OS help? • E.g., control voltage scaling • Control turning off devices

Lecture 27 Power Aware Architecture Design