Introduction to Energy Aware Computing
780 likes | 802 Vues
Explore the advancements and trends in energy-aware computing, from high-end processors consuming 10 kW to low-end processors delivering higher performance with lower energy consumption. Learn about power efficiency, top supercomputers' energy profiles, and strategies for reducing power consumption at all design levels.
Introduction to Energy Aware Computing
E N D
Presentation Transcript
IntroductiontoEnergy Aware Computing Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Energy Aware Computing Soesterberg, March 2012
Core i7 3GHz 100W • Intel Trends • #transistors follows Moore • but not freq. and performance/core 5 Henk Corporaal
Types of compute systems Henk Corporaal
A 20nm scenario (high end processor) • This means: • a 2cm2 processor consumens 10 kW • a bound of 100W requires only 1% to be active dark silicon Henk Corporaal
Intel's answer: 48-core x86 Henk Corporaal
Power versus Energy • Power P = fCVdd2 • switching activity (<1); f frequency; C switching capacitance, Vdd supply voltage • heat / temperature constraint • wear-out • peak power delivery constraint • Energy E = P*t or, for time varying P: P(t).dt • battery life • cost: electricity bill • Note: lowering f reduces P, but not necessarily E; E may even increase due to leakage (static power dissipation) Henk Corporaal
What's happening at the top Henk Corporaal
Top500 nr 1 • 1st : K Computer: • 10.51 Petaflop/s on Linpack • 705024 SPARC64 cores (8 per die; 45 nm) (Fujitsu design) • Tofu interconnect (6-D torus) • 12.7 MegaWatt Henk Corporaal
Top500 nr 2 • 2nd : Chinese Tianhe-1A: • 2.57 Petaflop/s • 186368 cores (Xeon + NVDIA proc) • 4.0 MegaWatt Henk Corporaal
What's happening at the low end…. • March 14, 2012: ARM announced the Cortex M0+ • "The 32-bit Cortex-M0+ consumes just 9µA/MHz on a low-cost 90nm LP process, around one third of the energy of any 8- or 16-bit processor available today, while delivering significantly higher performance" • 2-stage pipeline • option: 1-cycle MUL Henk Corporaal
Low end: How much energy in the air? [Rabaey 2009] Henk Corporaal
10000 W m / s p o M 0 0 0 1 ) IBM Cell s 1000 p W m o / 4 G Wireless s p o G M 0 ( W 0 m 1 / s p e o M c 0 1 n SODA a 100 ( 90 nm ) m P SODA r o Imagine Mobile HD o ( 65 nm ) w f B e r W Video m e r e / s t p o E t P M e f 1 f r i 10 c i 3 G Wireless e VIRAM Pentium M n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Computational efficiency (Mops/mW): what do we need? This means 1 pJ / operation or 1 TeraOp/Watt Woh e.a., ISCA 2009 Henk Corporaal
Green500: Top 10 in green supercomputing Henk Corporaal
Green500: evolution • 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation • 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation • Cell cluster, ranking 110 in top500 • 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation • IBM BlueGene/Q prototype 1, ranking 101 in top500, Peakperf: 65 TFlops; see also http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_super/ • 2011: best result = 2097 MFlops/Watt => 476 pJ / FloatingPt operation • IBM BlueGene/Q prototype 2 • power consumption: 41 kW / Peak 85 TFlop/s Henk Corporaal
Energy cost At ~$1M per MW, energy costs are substantial • 1 petaflop in 2010 uses 3 MW • 1 exaflop in 2018 possible in 200 MW with “usual” scaling • 1 exaflop in 2018 at 20 MW is DOE (Dep Of Energy) target • see also MontBlanc EU project: www.montblanc-project.eu • goal 200PFlops for 10MWatt in 2017 normal scaling desired scaling from: Katy Yelick, Berkeley Henk Corporaal
Reducing power @ all design levels • Algoritmic level • Compiler level • Architecture level • Organization level • Circuit level • Silicon level • Important concepts: • Lower Vdd and freq. (even if errors occur) / dynamically adapt Vdd and freq. • Reduce circuit • Exploit locality • Reduce switching activity, glitches, etc. P = α.f.C.Vdd2 E= P.dt E/cycle =α.C.Vdd2 Henk Corporaal
Algoritmic level • The best indicator for energy is …..…. the number of cycles • Try alternative algorithms with lower complexity • E.g. quick-sort, O(n log n) bubble-sort, O (n2) • … but be aware of the 'constant' : O(n log n) c*(n log n) • Heuristic approach • Go for a good solution, not the best !! Biggest gains at this level !! Henk Corporaal
Compiler level • Source-to-Source transformations • loop trafo's to improve locality • Strength reduction • E.g. replace Const * A with Add's and Shift's • Replace Floating point with Fixed point • Reduce register pressure / number of accesses to register file • Use software bypassing • Scenarios: current workloads are highly dynamic • Determine and predict execution modes • Group execution modes into scenarios • Perform special optimizations per scenario • DFVS: Dynamic Voltage and Frequency Scaling • More advanced loop optimizations • Reorder instructions to reduce bit-transistions Henk Corporaal
Architecture level • Going parallel • Going heterogeneous • tune your architecture, exploit SFUs (special function units) • trade-off between flexibility / programmability / genericity and efficiency • Add local memories • prefer scratchpad i.s.o. cache • Cluster FUs and register files (see next slide) • Reduce bit-width • sub-word parallelism (SIMD) Henk Corporaal
Organization (micro-arch.) level • Enabling Vdd reduction • Pipelining • cheap way of parallelism • Enabling lower freq. lower Vdd • Note 1: don't pipeline if you don't need the performance • Note 2: don't exaggerate (like the 31-stage Pentium 4) • Reduce register traffic • avoid unnecessary reads and write • make bypass registers visible Henk Corporaal
Circuit level • Clock gating • Power gating • Multiple Vdd modes • Reduce glitches: balancing digital path's • Exploit Zeros • Special SRAM cells • normal SRAM can not scale below Vdd = 0.7 - 0.8 Volt • Razor method; replay • Allow errors and add redundancy to architectural invisible structures • branch predictor • caches • .. and many more .. Henk Corporaal
Silicon level • Higher Vt (V_threshold) • Back Biasing control • see thesis Maurice Meijer (2011) • SOI (Silicon on Insulator) • silicon junction is above an electr. insulator (silicon dioxide) • lowers parasitic device capacitance • Better transistors: Finfet • multi-gate • reduce leakage (off-state curent) • .. and many more Wait for lectures of Pineda on Friday Henk Corporaal
Let's detail a few examples • Algoritmic level • Exploiting locality • Compiler level • Software bypassing • Architecture level • Going parallel • Organization level • Razor • Circuit level • Exploit zeros in a Multiplier • Silicon level • Sub-threshold Henk Corporaal
Algorithm level: Exploiting locality Generic platform: Level-2 Level-3 Level-4 Level-1 SCSI bus bus bus Chip on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory Henk Corporaal
Power(memory) = 33 Power(arithmetic) Data transfer and storage power Henk Corporaal
Loop transformations • Loop transformations • improve regularity of accesses • improve temporal locality: production consumption • Expected influence • reduce temporary storage and (anticipated) background storage • Work horse: Loop Merging • typically many enabling trafos needed before you can merge loops Henk Corporaal
Location Production/Consumption Consumption(s) Time Location Production/Consumption Consumption(s) Time Loop transformations: Merging for (i=0; i<N; i++) B[i] = f(A[i]); for (j=0; j<N; j++) C[j] = f(B[j],A[j]); for (i=0; i<N; i++) B[i] = f(A[i]); C[i] = f(B[i],A[i]); Locality improved ! Henk Corporaal
Foreground memory External memory interface CPU Background memory Loop transformations Example: for (i=0; i<N; i++) B[i] = f(A[i]); for (i=0; i<N; i++) C[i] = g(B[i]); for (i=0; i<N; i++){ B[i] = f(A[i]); C[i] = g(B[i]); } N cyc. 2N cyc. N cyc. 2 background ports 1 backgr. + 1 foregr. ports Henk Corporaal
for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1 storage size N Loop transformations Example: enabling trafo required Henk Corporaal
Compiler level: Software bypassing Datapath buffers RF More efficient Larger Local Memory Global Memory PAGE 30 • Register file consumes considerable amount of total processor power • > 15% in simple 5-stage RISC (2R1W, 32bx32) • Even more in VLIW and SIMD as size and number of ports increase Henk Corporaal
Reducing RF Accesses r4 r7 add r3, r4, r7 + r4 r7 add r12, r3, r7 + r3 r7 sw 0(r1), r12 + + r1 0 sw r12 r1 0 sw Only 3 RF reads are actually needed. • Many RF accesses can be eliminated • Bypass read: read operands from bypass network instead of RF • Writeback elimination: skip writeback if the variable is dead • Operand sharing: the same variable in the same port only needs to be read from RF once Henk Corporaal PAGE 31
Move-Pro: an Improved TTA • Being able to perform bypass is critical to code density: • FU output buffer is added to help • Eventually it is up to the compiler to get a good code density 32-bit 16-bit x3 R4 ->ALU[add].o • Unified input ports • with buffer: • Isolate FUs • Enable operand sharing R7 ->ALU[add].t ALU.o->R3 PAGE 32 • Original TTA has a few drawbacks: • Separate schedule of operands may increase circuit activity • The trigger port introduces extra scheduling constraints • TTA Code density is likely to be lower compared to RISC/VLIW • May need more slots for the same performance • Increases instruction fetching energy Henk Corporaal
Compiler Framework PAGE 33 • Low level IR • Similar to RISC assembly • With extra metadata to the backend • Local instruction scheduling Henk Corporaal
Scheduling Example Software bypassing & scheduling PAGE 34 • Direct translation results in bad code density • More instruction also means worse performance • Bypassing improves code density and reduces RF accesses • Performance and energy consumption are also improved Henk Corporaal
Graph-based Resource Model #Issue-Nodes are the same as #Issue-Slot PAGE 35 • Nodes represent resources • Resources are duplicated for each cycle • Edges represent connectivity or storage • Each node has capacity and cost • Cost determined by power model • Instruction cost is taken into account Henk Corporaal
Energy Results Compared to RISC • RF energy saving >70% • No loss in instr-mem • R1 and M2 has the same performance PAGE 36 • 3 Configurations • R1: RISC, 2R1W RF • M2: 2-issue MOVE-Pro, 2R1W RF • M3: 3-issue MOVE-Pro, 2R1W RF • 8KB (32-bit)/9KB (48-bit) I-Mem Henk Corporaal
Architecture level: going parallel • Running into the • Frequency wall • ILP wall • Memory wall • Energy wall • Chip area enabler: Moore's law goes well below 22 nm • What to do with all this area? • Multiple processors fit easily on a single die • Application demands • Cost effective • Reusue: just connect existing processors or processor cores • Low power: parallelism may allow lowering Vdd Henk Corporaal
CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2*2CV’2 = fCV’2 < P1 • Check yourself whether this worksfor pipelining as well ! Henk Corporaal
4-D model of parallel architectures How to speedup your favorite processor? • Super-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue • Single stream: Superscalar • Multiple streams • Single core, multiple threads: Simultaneously Multi-Threading • Multiple cores Henk Corporaal
IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Architecture methods1. Pipelined Execution of Instructions • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware • Some bad news: Hazards or pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline Henk Corporaal
* Architecture methods1. Super pipelining • Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op Henk Corporaal
Architecture methods2. Powerful Instructions (1) • MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Henk Corporaal
SIMD Execution Method time PE1 PE2 PEn Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methods2. Powerful Instructions (1) • SIMD computing • All PEs (Processing Elements) execute same operation • Typical mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Henk Corporaal
* * * * Architecture methods2. Powerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Many processors support this • Examples • MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4 |ai-bi| Henk Corporaal
Architecture methods2. Powerful Instructions (2) • MO-technique: multiple operations per instruction • Two options: • CISC (Complex Instruction Set Computer) • this is what we did in the 'old' days of microcoded processors • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Henk Corporaal
VLIW architecture: central Register File Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? Henk Corporaal
Level 1 Instruction Cache loop buffer loop buffer loop buffer FU FU FU FU FU FU FU FU FU Level 2 (shared) Cache register file register file register file Level 1 Data Cache Clustered VLIW • Clustering = Splitting up the VLIW data path- same can be done for the instruction path – • Exploit locality @ Level 0, for Instructions and Data Henk Corporaal
Architecture methods3. Multiple instruction issue (per cycle) • Who guarantees semantic correctness? • can instructions be executed in parallel • User: he specifies multiple instruction streams • Multi-processor: MIMD (Multiple Instruction Multiple Data) • HW: Run-time detection of ready instructions • Superscalar, single instruction stream • Compiler: Compile into dataflow representation • Dataflow processors • Multi-threaded processors Henk Corporaal
SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined S(architecture) = f(Op) * lt (Op) 10 VLIW Op I_set 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Four dimensional representation of the architecture design space <I, O, D, S> Mpar = I*O*D*S You should exploit this amount of parallelism !!! Henk Corporaal
Examples of many core / PE architectures • SIMD • Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) • VLIW • ADRES, TriMedia • more dynamic: Itanium (static sched., rt mapping), TRIPS/EDGE (rt scheduling) • Multi-threaded • idea: hide long latencies • Denelcor HEP (1982), SUN Niagara (2005) • Multi-processor • RaW, PicoChip, Intel/AMD, GRID, Farms, ….. • Hybrid, like , Imagine, GPUs, XC-Core, Cell • actually, most are hybrid !! Henk Corporaal