280 likes | 376 Vues
Ch. 3-3 CPU. Input and output. Supervisor mode, exceptions, traps. Memory management and address translation Caches CPU Performance and power consumption Co-processors. Elements of CPU performance. Cycle time. CPU pipeline. Memory system. Pipelining.
 
                
                E N D
Ch. 3-3 CPU Input and output.Supervisor mode, exceptions, traps.Memory management and address translationCachesCPU Performance and power consumptionCo-processors.
Elements of CPU performance • Cycle time. • CPU pipeline. • Memory system.
Pipelining • Several instructions are executed simultaneously at different stages of completion. • Various conditions can cause pipeline bubbles that reduce utilization: • branches; • memory system delays; • etc.
Pipeline structures • Both ARM and SHARC have 3-stage pipes: • fetch instruction from memory; • decode opcode and operands; • execute.
add r0,r1,#5 fetch decode execute fetch decode execute sub r2,r3,r6 cmp r2,#3 fetch decode execute time 1 2 3 4 ARM pipeline execution
Performance measures • Latency: time it takes for an instruction to get through the pipeline. • Throughput: number of instructions executed per time period. • Pipelining increases throughput without reducing latency.
Pipeline stalls • If every step cannot be completed in the same amount of time, pipeline stalls. • Bubbles introduced by stall increase latency, reduce throughput.
fetch decode ex ld r2 ex ld r3 fetch decode ex sub fetch decode ex cmp time Data stalls ldmia r0,{r2, r3} sub r2, r3, r6 cmp r2, #3
Control stalls • Branches often introduce stalls (branch penalty). • Stall time may depend on whether branch is taken. • May have to squash instructions that already started executing. • Don’t know what to fetch until condition is evaluated.
Branch stalls bne foo fetch decode ex bne ex bne ex bne sub r2,r3,r6 fetch decode . . . fetch foo add r0,r1,r2 fetch decode ex add time
Delayed branch • Requires that n instructions after branch be always executed whether branch is executed or not. • SHARC supports delayed and non-delayed branches. • Specified by a bit in the branch instruction. • If non-delayed, stall for two cycles.
L1=5; DM(I0,M1)=R1; L8=8; DM(I8,M9)=R2; CPU cannot use DAG onecycle just after loading DAG’s register. CPU performs NOP between register assign and DM. Example: SHARC code scheduling NOP inserted
L1=5; L8=8; DM(I0,M1)=R1; DM(I8,M9)=R2; Avoids two NOP cycles. Rescheduled SHARC code
Example: ARM execution time • Determine execution time of FIR filter: for (i=0; i<N; i++) f = f + c[i]*x[i]; • Only branch in loop test may take more than one cycle. • BLT loop takes 1 cycle in the best case, 3 in the worst case.
Execution time a for loop ; loop initiation code MOV r0,#0 ; use r0 for i MOV r8,#0 ; use separate index for arrays ADR r2,N ; get address for N LDR r1,[r2] ; get value of N MOV r2,#0 ; use r2 for f ADR r3,c ; load r3 with base of c ADR r5,x ; load r5 with base of x ;loop body loop LDR r4,[r3,r8] ; get c[i] LDR r6,[r5,r8] ; get x[i] MUL r4,r4,r6 ; compute c[i]*x[i] ADD r2,r2,r4 ; add into running sum ;update loop counter and array index ADD r8,r8,#4 ; add one word offset to array index ADD r0,r0,#1 ; add 1 to I ;test for exit CMP r0,r1 ; exit? BLT loop ; if i < N, continue tloop = tinit + N(tbody+tupdate) + (N-1)ttest,worst + ttest,best
Superscalar execution • Superscalar processor can execute several instructions per cycle. • Uses multiple pipelined data paths. • Programs execute faster, but it is harder to determine how much faster.
Data dependencies • Execution time depends on operands, not just opcode. • Superscalar CPU checks data dependencies dynamically: r0 r1 add r2,r0,r1 add r3,r2,r5 r2 r5 r3 data dependency
Memory system performance • Caches introduce indeterminacy in execution time. • Depends on order of execution. • Cache miss penalty: added time due to a cache miss. • Several reasons for a miss: compulsory, conflict, capacity.
CPU power consumption • Most modern CPUs are designed with power consumption in mind to some degree. • Power vs. energy: • Power is energy consumption per unit time • Heat generation depends on power consumption; • Battery life depends on energy consumption. • CMOS (Complementary metal oxide semiconductor)
CMOS power consumption • Voltage drops: power consumption proportional to V2. • Toggling: • A CMOS circuit uses most of the power which changing its output value • Reduce a circuit operation speed => more activity means more power. • Eliminating unnecessary changes to CMOS inputs • Leakage: • basic circuit characteristics • Some charge leaks out of circuit nodes even though CMOS is not active • can be eliminated by disconnecting power.
CPU power-saving strategies • Reduce power supply voltage. • Run at lower clock frequency. • Disable function units with control signals when not in use. • Disconnect parts from power supply when not in use.
Power management styles • Static power management: does not depend on CPU activity. • Example • user-activated power-down mode. • Power down by an instruction, but activated on the receipt of an interrupt or other event. • Dynamic power management: based on CPU activity. • Example: disabling function units of the CPU.
Application: PowerPC 603 energy features • Provides doze, nap, sleep modes, used by the program and operating system. • Dynamic power management features: • Can shut down unused execution units. • DC, IC, Load-Store, etc. • Cache organized into subarrays to minimize amount of active circuitry. • Two out of eight subarrarys used on any given clock cycle
PowerPC 603 activity • Percentage of time units are idle for SPEC integer/floating-point: unit Specint92 Specfp92 D cache 29% 28% I cache 29% 17% load/store 35% 17% fixed-point 38% 76% floating-point 99% 30% system register 89%97%
trs Prun run sleep Psleep tsr Power-down costs • Going into a power-down mode costs: • time; • energy. • Must determine if going into mode is worthwhile. • Can model CPU power states with power state machine.
SA-1100 VDD_FAULT VDD VDDX BATT_FAULT PWR_EN Application: StrongARM SA-1100 power saving • Processor takes two supplies: • VDD is main 3.3V supply. • VDDX is 1.5V. • Three power modes: • Run: normal operation. • Idle: stops CPU clock, with other logics still powered. • Real-time clock, operating system timer, interrupt control, general-purpose I/O, and power manager are operational. • Return to run mode upon receiving an interrupt • Sleep: shuts off most of the chip’s activity.
Sleep mode • Entering sleep mode • Shut down on-chip activity; • Reset the CPU; and • Negate the PWR_EN pin to tell that the chip’s power supply should be driven to 0 volts. • Enable a separate power supply for power manager • Low speed clock sufficient to manage sleep mode • CPU software sets several registers to prepare for sleep mode • Wake-up sequence • PWR_EN pin asserted and waits for 10 ms; • 3.686-MHz oscillator is ramped up to speed; and • Internal reset is negated and CPU boot sequence begins.
SA-1100 power state machine Prun = 400 mW run 10 ms 160 ms 90 ms 10 ms 90 ms idle sleep Pidle = 50 mW Psleep = 0.16 mW