Chapter 4 Assessing and Understanding Performance

Chapter 4Assessing and Understanding Performance 授課教師: 張傳育博士 (Chuan-Yu Chang Ph.D.) E-mail: chuanyu@yuntech.edu.tw Tel: (05)5342601 ext. 4337

Which of these airplanes has the best performance? Airplane Passengers Range (mi) Speed (mph) Passenger throughput Boeing 777 375 4630 610 228,750 Boeing 747 470 4150 610 286,700 BAC/Sud Concorde 132 4000 1350 178,200 Douglas DC-8-50 146 8720 544 79,424 • How much faster is the Concorde compared to the 747? • How much bigger is the 747 than the Douglas DC-8? • For different types of applications, different performance metrics may be appropriate. • Understanding how best to measure performance and the limitations of performance measurements is important in selecting a machine.

Computer Performance: TIME, TIME, TIME • Response Time (execution time) • The time between the start and completion of a task. • How long does it take for my job to run? • How long does it take to execute a job? • How long must I wait for the database query? • Throughput • The total amount of work done in a given time. • How many jobs can the machine run at once? • What is the average execution rate? • How much work is getting done? • If we upgrade a machine with a new processor what do we increase? • Response time and throughput are improved. • If we add additional processor to a system that uses multiple processors for separate tasks what do we increase? • Throughput is increasing.

Computer Performance: • To maximize performance, to minimize response time or execution time for some task. • Relative Performance If the performance of X is greater than the performance of Y X is n times faster than Y

Book's Definition of Performance • For some program running on machine X, PerformanceX= 1 /ExecutiontimeX • "X is n times faster than Y"PerformanceX / PerformanceY = n • Example: • Machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B? • Answer A is 1.5 times faster than B.

Execution Time • Time is the measure of computer performance • The computer that performs the same amount of work in the least time is the fastest. • Elapsed Time (response time) • counts everything (disk and memory accesses, I/O , etc.) • a useful number, but often not good for comparison purposes • CPU time (CPU execution time) • doesn't count I/O or time spent running other programs • can be broken up into system time, and user time • Our focus: user CPU time • time spent executing the lines of code that are "in" our program • CPU performance • System CPU time • Time spent in the OS performing tasks on behalf of the program • System performance

Clock Cycles time • Instead of reporting execution time in seconds, we often use cycles • Clock “ticks” indicate when to start activities: • cycle time = time between ticks = seconds per cycle • clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)A 200 Mhz. clock has a cycle time

CPU execution time for a program = CPU clock cycles for a program * clock cycle time CPU execution time for a program = CPU clock cycles for a program / clock rate How to Improve Performance • So, to improve performance (everything else being equal) you can either • Reduce the # of required cycles for a program, or • Reduce the clock cycle time or, • Increase the clock rate.

Example Answer • Our favorite program runs in 10 seconds on computer A, which has a 4GHz clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should we tell the designer to target?

Comparison of two different implementations the number of instructions of class i executed • The execution time must depend on the number of instructions in a program • CPU clock cycles for a program = Instructions for a program * Average clock cycles per instruction • CPI (clock cycles per instruction) • The average number of clock cycles each instruction takes to execute. • CPU time for a program = Instruction count * CPI * clock cycle time • CPU time for a program = (Instruction count * CPI ) / clock rate • If the CPU clock cycles are consist of different types of instruction and using individual clock cycle count.

CPI Example • Suppose we have two implementations of the same instruction set architecture (ISA). For some program,Computer A has a clock cycle time of 250 ps. and a CPI of 2.0 Computer B has a clock cycle time of 500 ps. and a CPI of 1.2 Which computer is faster for this program, and by how much? • Answer: Machine A is 1.2 times faster than machine B

Could assume that # of cycles = # of instructions? How many cycles are required for a program? 1st instruction 2nd instruction 3rd instruction ... 4th 5th 6th time This assumption is incorrect,different instructions take different amounts of time on different machines.Why?hint: remember that these are machine instructions, not lines of C code

Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions Different numbers of cycles for different instructions time

A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) a floating point intensive application might have a higher CPI MIPS (millions of instructions per second)this would be higher for a program using simple instructions Now that we understand cycles

Basic Components of Performance • Performance is determined by execution time • # of cycles to execute program • # of instructions in program • # of cycles per second • average # of cycles per instruction • average # of instructions per second

Example：# of Instructions • A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). • The first code sequence has 5 instructions: • 2 of A, 1 of B, and 2 of C • The second sequence has 6 instructions: • 4 of A, 1 of B, and 1 of C. • Which code sequence executes the most instructions? Which code sequence will be faster? What is the CPI for each sequence? Sequence 1 execute fewer instruction

# of Instructions Example (cont.) • Answer: So code sequence 2 is faster 只使用一個factor來衡量效能是危險的

Using MIPS as a performance metric • MIPS (Million instructions per second) • MIPS = Instruction count / Execution time*106 • Faster machines have a higher MIPS rating. • There are three problems with using MIPS as a measure for comparing machines: • MIPS specifies the instruction execution rate but does not take into account the capabilities of the instructions. • We cannot compare computers with different instruction sets using MIPS, since the instruction counts will certainly differ. • MIPS varies between programs on the same computer • A machine cannot have a single MIPS rating for all program. • MIPS can inversely with performance.

MIPS example (Ed. 2, page 78) • Two different compilers are being tested for a 500 MHz machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software.The first compiler's code uses 5 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions.The second compiler's code uses 10 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions. • Which sequence will be faster according to MIPS? • Which sequence will be faster according to execution time?

MIPS example (Ed. 2, page 78, cont.) Compiler 1 faster than compiler 2 Compiler 2 has higher MIPS than Compiler 1

Performance best determined by running a real application Use programs typical of expected workload The set of programs run would form a workload. Compare the execution time of the workload on the two machines. Using a set of benchmarks to evaluate machine. benchmark：Programs specifically chosen to measure performance. The best type of programs to use for benchmarks are real application e.g., compilers/editors, scientific applications, graphics, etc. Choosing programs to evaluate performance

Choosing programs to evaluate performance( cont.) • Small benchmarks • nice for architects and designers • easy to standardize • can be abused • Different classes and applications of computers will require different types of benchmarks. • For desktop computer, the most common benchmarks are either measures of CPU performance or benchmarks focusing on a specific task. • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs • can still be abused (Intel’s “other” bug) • valuable indicator of performance (and compiler technology) • For embedded computer • EEMBC

SPEC ‘89 • Compiler “enhancements” and performance • SPEC89 performance ratios for the IBM Powerstation 550 using two different compilers. (compare with VAX-11/780)

The SPEC’95 Benchmarks

SPEC ‘95 Does doubling the clock rate double the performance? 造成clock rate増快1倍，但performance只增加1.3~1.7倍的原因在於記憶體的存取速度未提昇

SPEC CPU2000 (the latest release) 14個浮點數程式 12個整數程式

SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance?

Performance Report • The guiding principle in reporting performance measurements should be “reproducibility” • We should list everything another experimenter would need to duplicate the results. • The list must include • The version of the OS • Compiler • The input • Machine configuration • Ex.:(Fig.4.3)

Example of Performance Report

Example-Comparing and Summarizing Performance • How should a summary be computed?Execution times of two programs on two different machines. A is 10 times faster than B for program 1.B is 10 times faster than A for program 2.Total Execution Time (the simplest approach to summarizing relative performance) B is 9.1 times faster than A for programs 1 and 2 together.

Example-Comparing and Summarizing Performance (cont.) • Arithmetic Mean • The average of the execution times • Weighted Arithmetic Mean • Assign a weighting factor wi to each program to indicate the frequency of the program in that workload.

Amdahl's Law Execution Time After Improvement =Execution Time Unaffected +( Execution Time Affected / Amount of Improvement ) • Example:Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times and 5 times faster?“ • Solution:(1) run 4 times faster (2) run 5 times faster • Principle: Make the common case fast

Example • Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions? • Solution:Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )Execution Time After Improvement =5+5/5=6 speedup=10/6=1.667

Remember • Performance is specific to a particular program/s • Total execution time is a consistent summary of performance • For a given architecture performance increases come from three sources: • increases in clock rate (without adverse CPI affects) • improvements in processor organization that lower CPI • compiler enhancements that lower CPI and/or instruction count • Pitfall: (陷阱) • expecting improvement in one aspect of a machine’s performance to affect the total performance • Using MIPS as a performance metric • Using the arithmetic mean of normalized execution times to predict performance.

Example: The geometric mean is independent of which data series use for normalization because of

Fallacy(謬誤) • Hardware-independent metrics predict performance. • Use code size as a measure of speed. • Synthetic benchmarks predict performance. • The synthetic benchmarks are not real applications because these programs compute anything a user would not interesting. • Compiler and hardware optimization can inflate performance of these benchmarks. • The geometric mean of execution time ratios is proportional to total execution time. • 在圖2.10中，若執行程式一100次，程式二1次則Machine A的總執行時間=100+1000=1100Machine B的總執行時間=1000+100=1100所以Machine A和Machine B有相同的performance?

Chapter 4 Assessing and Understanding Performance