Peformance Measurements

Peformance Measurements

Performance • The entire point of computer hardware is to “perform” • Operate correctly • Implement useful operations • Do so as fast as possible • What differences do we see in performance? • Almost all computers operate correctly (within reason) • Most computers implement useful operations • This is a matter of taste... • Computers all operate at different speeds • Speed is the most important performance metric 2.1

Measuring speed • Which is faster? • Raw speed • Ferrari wins • Ferrari: 170 MPH, 2 people • School Bus: 57 MPH, 40 people • Throughput • Ferrari: 340 passenger-MPH • School Bus: 2280 passenger-MPH • Other issues... • Range, reliability, cost 2.1

Peformance of computers • How long does it take to run my favorite program? • CPU time • Response time • Batch throughput • To compare two computers, we compare the execution time of the same program on the two computers • Faster one wins • Lower execution time is better 2.2

A little background... • Computer programs are (usually) written in a high-level language (e.g. C) • The compiler converts this code into machine-language instructions • The CPU interprets machine-language instructions nd xecutes them • The performance of a program depends on: • The number and types of instructions executed • How fast the CPU can execute those instructions 2.2

Period Tick-tock • Almost all modern computers are based on a clock • All events are controlled by and synchronized to a regular clock • Clocks are just regular periodic waveforms • Cycle time: time for the waveform to repeat itself • Also known as the clock period • Frequency: 1/Period • Example: • 10ns clock cycle --> period = 10-8 s • Frequency 1/10ns = 1/10-8 s = 108 cycles/sec 2.2

Execution time • Since the cycle time of a computer is constant, we can express time in terms of CPU cycles • Performance can be improved by: • Decreasing the cycle time • Hardware solution: Use faster technology • Decreasing the number of cycles for the program • Software: Write a better program • Hardware: Re-design CPU • Time = cycles * cycle time • Time = cycles / clock frequency 2.3

Instruction execution time • Every instruction takes time to execute • Some instructions may take more or less time than others • The time for an instruction is expressed in terms of clock cycles Instruction Cycles ADD 1 MULT 4 CMP 1 SUB 2 Example: • The time to run a program depends on: • How many instructions • What type of instructions • 30 ADDs and 4 MULTs --> 46 cycles 2.3

Average CPI • The Cycles-Per-Instruction (CPI) varies depending on what instructions are used • Take an Average CPI • Cycles = Number of Instructions * Average CPI • Average CPI should reflect the mix of instructions in the program • A large proportion of 4-cycle MULTs should raise the CPI, a large proportion of 1-cycle ADDs should lower it • The average should be the weighted average 2.3

Weighing the average Instruction Cycles % ADD 1 40 MULT 4 10 CMP 1 20 SUB 2 30 Example mix of instructions Average CPI = 1 * 40% + 4 * 10% + 1 * 20% + 2 * 30% = .4 + .4 + .2 + .6 = 1.6 Notice: The average CPI depends on the code we’re executing! 2.3

How long? • Execution time = Cycles * Cycle Time • Cycles = Average CPI * Instruction Count • Execution time = Instruction Count * CPI * Cycle Time • Remember, lower is better • Reducing any one of the three components reduces execution time • Instruction count - Reduced through better code, better compiler, change in CPU design • CPI - Reduced through better code, better compiler, change in CPU design • Cycle time - Reduced through technology change, change in CPU design 2.3

Examples System A: 10s to run a program. Clock period is 20ns. System B: Change clock to 10ns, no other changes. How long does it take to run the same program on System B? --> TimeA = CPIA x PeriodA x InstructionsA = 10s --> TimeB = CPIA x PeriodB x InstructionsA = ? (PeriodB = PeriodA * 0.5) --> TimeB = CPIA x PeriodA * 0.5 x InstructionsA = TimeA * 0.5 = 5s System C: 10s to run a program, 20ns clock, 400,000,000 instr. What is the CPI? --> CPIC = TimeC / (PeriodC x InstrC) = 10s / (20 x 10-9 x 4 x 108) = 1.25 System D: 400,000,000 instr., 22ns clock and a CPI of 1.10. How long does it take to run the program on system D? --> TimeD = CPID x PeriodD x InstructionsD = 1.10 x 22ns x 4 x 108 = 9.68s 2.3

Examples Assume an add takes 1 cycle, a mult 4 cycles, and a sub 2 cycles Two different compilers produce the following loops for the same code: A: multaddmultsub B: addaddmultsubaddadd loop1000000times loop1000000times What’s the CPI? CPIA = (4 + 1 + 4 + 2)/4 = 2.75 CPIB = (1 + 1 + 4 + 2 + 1 + 1)/6 = 1.667 How long does it take to run each program on a 200MHz CPU? TimeA = CPIA x PeriodA x InstructionsA = 2.75 x 5ns x 4000000 = .0055s TimeB = CPIB x PeriodB x InstructionsB = 1.667 x 5ns x 6000000 = .0050s 2.3

Benchmarks and Performance Metrics

Performance metrics • I’m concerned with how long it takes to run my program • Chances are, that number isn’t published with the specs for the computer • Standardized metrics • Benchmarks (SPEC, etc.) • MIPS • MFLOPS 2.4

Benchmarks • Run a suite of benchmark programs, average the performance • Benchmarks - programs thought to be representative of commonly-used programs • Advantages • Actually corresponds to execution time! • Represents a wider range of programs • Disadvantages • Are they running your program? • Who picks the benchmarks? Be wary if the manufacturer does! 2.5

SPEC Web Page (www.spec.org) SPEC Benchmarks • SPEC (System Performance Evaluation Cooperative) maintains a set of benchmark suites • New tests use SPEC CPU2000 • CINT2000 - Performance on integer programs • CFP2000 - Performance on floating-point programs • Larger numbers indicate better performance • Tests prior to 2000 used CPU95 • CPU 2000 only has only a few years of data 2.6

SPECint95 Results for Intel Processors Better cache design(On-chip vs Off-chip) SPECint95 200 300 400 500 700 800 100 600 Clock Speed (MHz) Note: Results depend on Cache size, memory system, and motherboard

SPECfp95 Results for Intel Processors SPECfp95 200 300 400 500 700 800 100 600 Clock Speed (MHz) Note: Results depend on Cache size, memory system, and motherboard

3200+ 2700+ 2600+ 2400+ 2200+ 1800+ 1600+ 1500+ CINT2000 Results for Various Processors Note: Athlon Part numbers are not the CPU MHz! Part numbers labeled on graph CINT2000 Clock Speed (GHz) Note: Results depend on Cache size, memory system, and motherboard

3200+ 2700+ 2600+ 2200+ 2400+ 1800+ 1600+ 1500+ CFP2000 Results for Various Processors Note: Athlon Part numbers are not the CPU MHz! Part numbers labeled on graph CFP2000 Clock Speed (GHz) Note: Results depend on Cache size, memory system, and motherboard

Limited benefits... • Assume we’re running a program that spends 40% of its time accessing memory • Now, we upgrade the processor from 200 MHz to 800 MHz • How much faster does the program run? • We’ve reduced the time for 60% of the program by 4 • But we haven’t touched the memory access time • New total = Old * (40% + (60% / 4)) = Old * (40% + 15%) = Old * 55% Not even twice as fast! 2.7

New Execution time =Execution time affected by impr. + Unaffected Execution TimeAmount of Improvement Amdahl’s Law • Practical effect: “Make the common case fast” • Corollary: “Forget about the rare case” • Example: 70% of my execution time is done on integer ADDs, and 6% on floating point ADDs. Total execution time is 100 seconds. • What’s the effect of making integer ADDs twice as fast? • New time = (100 * .70) / 2 + (100 * .30) = 35+30=65 seconds • What’s the effect of making F.P. Adds twice as fast? • New time = (100 * .06) / 2 + (100 * .94) = 3+94 = 97 seconds 2.7

cycles * 10-6 Instructions CPI MIPS = = * 10-6 second second cycles 10-6 10-6 clock rate = = * * CPI second CPI (Native) MIPS Million Instructions Per Second • MIPS does not take into account how many instructions must be executed in a program • Example: Same program, written two ways 1. 1,000 instructions, CPI 1.2, 1.0 MHz clock • Execution time = 1.2 ms, MIPS = 1/1.2 = .833 2. 500 instructions, CPI 2.0, 1.0 MHz clock • Execution time = 1.0ms, MIPS = 1/2.0 = .500 2.4

Avoid MIPS (the metric, not the processor) • Higher MIPS doesn’t always mean better performance • Highest MIPS corresponds to using the smallest (fastest) instructions to lower CPI MIPS = clock rate / (CPI * 1,000,000) • Peak MIPS is pointless • Peak MIPS is just what MIPS you get with smallest instructions • Usually, CPI is 1.0 for this • Just re-expressing clock rate in MHz 2.4

MFLOPS • Million Floating-point Operations Per Second • MFLOPS is similar to MIPS • Measures floating-point operations (mult, divide, add,...) • Suffers same problems as MIPS • Different operations cost different amounts • Peak MFLOPS is especially bad 2.4

Performance Summary • Execution time is the most important performance metric • Basic formula for performance: • Execution time = instructions * cycle time * CPI • Amdahl’s law describes how making limited improvements affects the bottom line • Only make improvements in areas that are commonly used • Standard benchmarks help us to compare performance of various computers • Beware of overly-simplified comparisons

Pitfalls and Fallacies • Processors with the same ISA can be compared by clock rate or a single benchmark suite alone • We don’t know the pipeline structure and memory system • Peak performance tracks observed performance • One processor may operate closer to peak performance most of the time than another • MIPS is an accurate measure of performance

Example We wish to consider the performance of two different machines: M1 and M2. The clock frequencies for the two machines are as follows: • M1 M2 • Clock Frequency 300 MHz 200 MHz Two programs were run on both machines and the following measurements were made: • Program Time on M1 Time on M2 • 1 06 seconds 04 seconds • 2 08 seconds 10 seconds In addition, the following additional measurements were made: • Program No. of Instructions No. of Instructions • Executed on M1 Executed on M2 • 1 180x10^6 100x10^6 • For each program, which machine is faster and by how much? • Find the clock cycles per instruction (CPI or average CPI) for Program 1 on both machines • On M1, each multiplication instruction involves 20 clock cycles. Suppose 20% of the instructions in Program 1 running on M1 are multiplications. What percentage of the CPU time is spent doing multiplications during the execution of Program 1 on M1? • Find the instruction execution rate (i.e., the number of instructions executed per second) for each machine when running Program 1 • Assuming the CPI for the machines is constant, find the instruction count for Program 2 running on each machine using the execution times.

Solution 1. For program 1, M2 is 2sec or (6-4)/6 = 33% faster For program 2, M1 is 2 sec or (10-8)/10 = 20% faster 2. tM1P1 = INSTRM1P1 x CPIM1P1 x 1/fM1 => CPIM1P1 = (tM1P1 x fM1)/INSTM1P1 = (6 x 300)/180 = 10 Likewise CPIM2 = (4 x 200)/100 = 8 3. INSTRMULTM1P1 = 0.2 x 180x10^6 = 36x10^6 instructions tMULTM1P1 = INSTRMULTM1P1 x 20 x 1/(300x10^6) = 720/300 = 2.4 sec tMULTM1P1/ tM1P1 = 2.4/6 = 40% 4. MIPSM1P1 = (INSTRM1P1 /tM1P1)*10^6 = 180/6 = 30 MIPSM2P1 = (INSTRM2P1 /tM2P1)*10^6 = 100/4 = 25 5. tM1P2 = INSTRM1P2 x CPIM1P2 x 1/fM1 => INSTRM1P2 = (tM1P2 x fM1)/ CPIM1P1 = (8 x 300x10^6)/10 = 240x10^6 INSTRM2P2 = (tM2P2 x fM2)/ CPIM2P1 = (10 x 200x10^6)/8 = 250x10^6

Example

Review Questions • Is CPI constant for a given processor (does not change from one program to another)? • Two processors with the same Instruction Set Architecture have the same CPI • True • False • Is MIPS constant for a given processor (does not change from one program to another)? • Two processors with the same Instruction Set Architecture have the same MIPS • True • False

Review Questions • Which of the following performance metrics is generally easier for the programmer to improve? • The instruction count • The average CPI • The clock frequency • peak MIPS • What would you consider as most important when selecting the fastest processor for a certain application domain? • The operating clock frequency • MIPS • Peak MIPS • Execution time for relative benchmarks • How can you increase a processor’s clock frequency? • Write a better program • Use a better compiler • Implement the processor in a faster VLSI technology • Use a larger memory

Example We wish to consider the performance of two different machines: M1 and M2. The clock frequencies for the two machines are as follows: M1 M2 Clock Frequency: 800 MHz 1000 MHz A program was run on both machines and the following measurements were made: Time on M1 Time on M2 2.5 seconds 2 seconds In addition, the following additional measurements were made: No. of Instructions No. of Instructions Executed on M1 Executed on M2 100x10^6 125x10^6 Finally, the frequency that instructions occur in the program for M1 and M2 are shown in the following table Instruction M1% M2% ADD 40 60 MULT 10 8 CMP 20 12 SUB 30 20 • Find the clock cycles per instruction (CPI or average CPI) for Program on both machines • How much faster will the program run on M1 and M2 respectively if we • reduce the execution time of the ADD instruction by 20%, assuming that an ADD instruction requires 5 cycles on both machines • reduce the execution time of the MULT instruction by 20%, assuming a MULT instructions requires 20 cycles on M1 and 25 cycles on M2 • Which is better for M1 and which for M2?

Peformance Measurements

Peformance Measurements

Presentation Transcript

MEASUREMENTS

MiX10 Compiling MATLAB for High Peformance Computing

Measurements

Measurements

QUARTERLY PEFORMANCE OF THE MPS APRIL 2013

Stock Peformance

Measurements

Measurements

Measurements

Measurements

Measurements

Measurements

Measurements

Measurements

Measurements

How to Increase Your Dell Laptop Peformance

Measurements

Measurements

Measurements

MEASUREMENTS