CS 1104 Help Session II Performance Measures

CS 1104 Help Session IIPerformance Measures Colin Tan ctank@comp.nus.edu.sg http://www.comp.nus.edu.sg/~ctank

Basic ConceptsInstruction Execution Cycles • Processors execute instructions in several steps: • Instruction Fetch (IF): • Instructions are fetched from memory and placed into an Instruction Register (IR). • Instruction Decode (ID): • The opcode portion of the instruction is sent to a decoder, which generates control signals. • Control signals determine tell the Arithmetic Logic Unit (ALU) what to do with data; add, rotate the bits, etc. • The operands portion may be sent to the register-file to fetch register data, or sent directly to the ALU to be operated on (for constants). • Operand Fetch (OF): • Data required for the operation is taken from memory or the register file and sent to the ALU inputs

Basic ConceptsInstruction Execution Cycles • Execution steps (cont’d): • Instruction Execute (IE): • The ALU computes the results based on the data fetched and the control signals generated. • Writeback (WB): • The results are written back to the destination register or memory location.

Basic ConceptsThe Need for Synchronization • How will the processor know: • When the instruction has been fetched and placed into IR? • If the instruction is not yet in IR, neither the opcodes nor operands will make sense! • Decoding nonsense and fetching invalid data leads to incorrect execution. • When the instruction has been decoded? • If the instructions have not been decoded completely, the ALU is receiving invalid control signals. • When the operands have been fetched? • If the operands have not yet been fetched from the registers or from memory, then the inputs to the ALU are invalid, and the ALU will compute invalid results!

Basic ConceptsClock Cycles • The other steps (IE, OF, WB) also need to know when to proceed in order to work correctly. • To coordinate each step, the processor relies on a series of “ticks” called clock cycles (CC). • CC1: Perform IF • By the end of CC1, the instruction is definitely sitting in IR, and the decoder can proceed to interpret the opcode. • CC2: Perform ID • Decode the instruction in IR, and generate all the control signals by the end of this clock cycle. • CC3: Perform OF • Fetch the data from registers or from memory. Must get all the data ready and presented to the ALU by the end of this clock cycle.

Basic ConceptsClock Cycles • CC4: Perform IE: • The ALU must operate (i.e. add, subtract etc) on the inputs and produce the results by the end of this clock cycle. • CC5: Perform WB: • The outputs of the ALU must be written back to register or memory by the end of this clock cycle. • CC6: Start IF of next instruction • If every step obeys the constraints laid out here, then each step will know for sure that the results of the previous step are already available before starting, and execution will proceed correctly.

Basic ConceptsInstruction Classes • A typical processor supports many instructions. • Typically instructions are divided into groups: • Arithmetic Instructions: add, sub, mul, div, mod • Bitwise Instructions: rol, ror, shl, shr, and, or, not • Floating Point Instructions: fadd, fsub, fmul, fdiv • Load/Store Instructions: lw, sw • Etc.

Basic ConceptsClass Cycles Per Instructions • We have seen how instructions take several clock cycles to execute (in our example, each instruction takes 5 clock cycles). • Each instruction actually takes different number of clock cycles to execute, depending on how complex the instruction is, or how slow each stage of an instruction each. • E.g. Floating Point Adds: More complex than integer adds, and require more clock cycles. • lw, sw access memory, which takes more clock cycles to fetch an operand from compared with registers.

Basic ConceptsClass Cycles Per Instruction • The Class Cycles Per Instruction (class CPI) is the average number of clock cycles required by instructions within a particular class E.g.: # of cycles for ADD: 2 cycles # of cycles for SUB: 2 cycles # of cycles for MUL: 4 cycles # of cycles for DIV: 8 cycles --------------- Total: 16 cycles Average: 16/4 = 4 CPI. • So the class CPI for this class of instructions is 4.

Basic ConceptsInstruction Frequency • A program (e.g. Microsoft Word) is made up of many instructions coming from each of the different classes of instructions. • The number of instructions in each class is called the “instruction frequency” of that class. • This is often expressed as a percentage or as a fraction.

Basic ConceptsOverall Cycles Per Instruction • The class instruction frequency and the class CPI can be used to compute what the overall Cycles Per Instruction, or overall CPI of a particular program. • Each type of instruction would take a different number of clock cycles. • A program consists of several different types of instructions. • The overall CPI is the average number of cycles required to execute each instruction, across all types of instructions.

Calculating Overall CPI • Find the overall CPI of a program running on a processor with the class CPIs and instruction frequencies shown here: Class CPI Instruction Frequency A 3 0.4 B 2 0.25 C 4 0.15 D 5 0.20

Calculating Overall CPI • Let’s assume that the total number of instructions is IC. Then there are 0.4IC instructions in class A, 0.25IC in class B, 0.15IC in class C and 0.2 IC in class D. • Total number of clock cycles used by instructions in class A is 0.4IC x 3, class B is 0.25IC x 2, class C is 0.15IC x 4, class D is 0.2IC x 5 • Hence total number of clock cycles used by this program is 0.4IC x 3 + 0.25IC x 2 + 0.15IC x 4 + 0.2IC x 5 • Number of instructions is IC. Hence average number of cycles per instruction (average CPI) is (0.4IC x 3 + 0.25IC x 2 + 0.15IC x 4 + 0.2IC x 5)/1.0IC • IC cancels off, leaving 0.4 x 3 + 0.25 x 2 + 0.15 x 4 + 0.2 x 5, the famous “Overall CPI”. Final answer is 2.7.

Calculating Overall CPI • Suppose the previous program was re-compiled with a different compiler, and the CPI/instruction frequency table is modified to the one below: • Class CPI Instruction Frequency • A 3 0.2 • B 2 0.35 • C 4 0.15 • D 5 0.20

Calculating Overall CPI • We take a short-cut and use the “famous formula”: Overall CPI = 3 x 0.2 + 2 x 0.35 + 4 x 0.15 + 5 x 0.2 = 2.9 • If we left the answer like this, it will WRONG! • Reason: The instruction frequencies do not add up to 1.0! • Returning back to definitions, let’s compute the total number of clock cycles taken by this program: • Total Clock Cycles = 0.2IC x 3 + 0.35IC x 2 + 0.15IC x 4 + 0.2IC x 5 • Total number of instructions = 0.2IC + 0.35IC + 0.15IC + 0.2IC = 0.9IC

Calculating Overall CPI • Finding the overall CPI: (0.2IC x 3 + 0.35IC x 2 + 0.15IC x 4 + 0.2IC x 5) / (0.9IC) Canceling out IC, we get: (0.2 x 3 + 0.35 x 2 + 0.15 x 4 + 0.2 x 5) / 0.9 Final answer is 3.22 • Moral: Always divide the overall CPI you get with the total frequency. In the previous example, the total frequency was 1.0, and we didn’t have a problem. Here this is not the case.

Calculating Peak CPI • The peak overall CPI is obtained when every instruction in a program is from the fastest class. Using our previous example, we will have peak performance if our instruction frequencies are as shown. • This will give us a peak CPI of 0.0 x 3 + 1.0 x 2 + 0.0 x 4 + 0.0 x 5 = 2.0 • Class CPI Instruction Frequency • A 3 0.0 • B 2 1.0 • C 4 0.0 • D 5 0.0

Calculating Peak CPI • In general, the peak overall CPI will be the CPI of the fastest class. • It is not possible to modify the class CPIs without modifying the hardware organization itself. • However, by hacking the hardware, the peak class CPI can be as low as 0!

Basic ConceptsClock Rate • We have seen how the processor coordinates the various instruction execution stages using a common “tick”, or clock cycle. • The number of ticks per second is called the “clock rate”, or “clock frequency”. • Obviously the higher the clock rate, the faster each stage has to complete, and therefore the faster the processor completes an instruction: • This implies that a higher clock rate will give you faster processors. • However there is a limit to how fast each stage can do something. • Cranking the clock rate beyond the capabilities of the hardware will cause execution to fail.

Basic ConceptsClock Rate • To overcome speed limitations, processor designers often make compromises in the designs for each stage: • The compromises allow each stage to work faster than before, allowing you to crank up the clock rate faster than ever. • Such compromises give you faster execution rates under ideal circumstances, but may give you worse performance under normal circumstances. • This is because the compromises result in higher class CPIs. • Hence faster clock rate may actually result in poorer performance • This translates to longer execution times for a program. • The length of a clock cycle measured in seconds is called the clock cycle time or clock period. It is equal to the reciprocal of the frequency (i.e. cycle time = 1/(clock_rate))

Execution Time • The execution time T of a program is the amount of time a program takes to run to completion. • This will depend on the overall CPI, the total number of instructions executed (IC), and the clock rate (R) of the processor. • IC x CPI will give us the total number of clock cycles used to execute all the instructions in the program • (IC x CPI) / R will give us the execution time. • If my program takes 10,000 cycles, and if my clock produces 100,000 cycles per second, then my program would take 10,000/100,000 = 0.1 seconds to execute. • Hence T = (IC x CPI)/R

Execution Time • From the previous example, suppose the program has a total of 15,000,000 instructions, and suppose that the clock rate of the processor is 500 MHz, what is the total execution time of the program? • T = (15 x 10^6) x 2.7 / 500 x 10^6 = 0.081 seconds.

Execution Time Issues • The execution time computed is unique only to this program. Other programs will have different execution times. • Execution time is affected by: • Hardware Organization: This affects individual class CPIs, and hence the overall CPI. • E.g. ADD instructions implemented using carry-propagate adders will have much higher CPIs than those implemented using carry-generate adders. • Compiler Technology: This affects the individual class frequencies • A good compiler will select more instructions from faster classes to accomplish the same objective.

Execution Time Issues • Execution Time is affected by (cont’d) • The program being run • Different programs will have different instruction distributions (i.e. different instruction class frequencies), resulting in different overall CPIs. • Different programs will have different instruction counts IC • Instruction Set Architecture • A richer ISA will give the compiler more choices of instructions to use to minimize IC, CPI or both. • All this will give you different execution time T.

Benchmarking • Benchmarks allow us to determine the performance of a system, usually relative to another system. • A common benchmark that we use is execution time. We take the same program and run it on two machines, and compare their execution times. • We cannot use overall CPI or clock frequencies as basis for comparisons: • High clock frequency processors may make compromises that dramatically increase individual class CPIs, and hence overall CPI. • Instructions may have very low CPIs because clock cycle times are very big. • Long clock cycle times mean that the processor may be able to accomplish >1 step in 1 clock cycle, leading to lower cycle requirements. • Unfortunately due to low clock rates, performance may be poor.

BenchmarkingExecution Time Example • The processor in the previous example is optimized, and the new class CPIs are shown below. Clock frequencies and instruction counts remain the same. How much faster is the new machine over the old? • Class CPI Instruction Frequency • A 2 0.4 • B 1 0.25 • C 5 0.15 • D 4 0.20

BenchmarkingExecution Time Example • Overall CPI = 2 x 0.4 + 1 x 0.25 + 5 x 0.15 + 4 x 0.2 = 2.6 Execution Time = [2.6 x (15 x 10^6)] / 500 x 10^6 = 0.0936s Previous Execution Time = 0.078 s • We can measure the speed-up by taking the old execution time and dividing it by the new: • Speedup = 0.081 / 0.078 = 1.04 • This figure of 1.04 means that the new design is 1.04 times faster than the old one.

BenchmarkingInstruction Throughput • Measuring how fast a machine can execute a particular program is just one way of determining performance. • Another good measure is instruction throughput, or how many instructions a processor can execute per second. • The most common measure for throughput is MIPS, which is short for “Millions of Instructions Per Second”. • This is not to be confused with the MIPS R2000. In this case, this MIPS is actually a company’s name. • So we have two meanings for MIPS: • Millions of Instructions Per Second • The company that makes the R2000.

BenchmarkingMIPS Example • Find the MIPS rating for both machines used in these notes: • CPI for first machine: 2.7 • This means that every instruction requires, on average, 2.7 cycles. • The clock rate is 500 MHz, so each second there are 500 x 10^6 cycles. • Therefore you can execute 500 x 10^6 / 2.7 = 185.2 x 10^6 instructions per second, or 185.2 MIPS. • CPI for second machine: 2.6 • Clock rate remains the same at 500x10^6 Hz. • So throughput is 500 x 10^6 / 2.6 = 192.3 MIPS

Types of Benchmarks • Micro-Benchmarks • These are very small benchmarks aimed primarily at gauging the peak performance of a processor. • Kernel Benchmarks • These are very small benchmarks designed to measure processor performance (e.g. benchmarks to measure MIPS ratings). • Full Applications Benchmarks • These use actual applications (or simulations of actual applications) to measure the performance of CPU, memory and IO systems. Gives a good idea of how system will perform running such applications. • Target Workload • These use the actual programs that are going to be run on the system to measure performance.

Amdahl’s Law • Amdahl’s Law basically states that: • Execution time depends on a number of factors, such as the speeds of various classes of instructions. • If you improved the performance of one factor by X times, then the overall improvements in execution time will always be less than X. • If we were to improve the execution time of a particular class of instructions, then the new execution time is given by: New Ex Time = Ex Time of unaffected classes + (Ex Time of affected class / speedup)

Amdahl’s Law • Suppose a program runs in 100 seconds on a machine, and multiplies account for 80 seconds of this time. What improvement in execution time will we have if we improved (i) executions by 5 times, ii) improved the other instructions by 10 times? • i) New ex time = unaffected time + affected time / speedup = 20 + 80/5 = 36 seconds Improvement = 100/36 = 2.77 times faster. • ii) New ex time = 80 + 20/10 = 82 seconds • Improvement = 100 / 82 = 1.22 times faster • Moral: Always improve the common case to get the best increase in performance! • Here the common case is the multiply (80%). Improving multiplies by 5 times gives far better gains than improving the other instructions (20%) by 10 times!

Summary • We looked at how instructions take several steps to execute, and each step is synchronized with the tick of a clock : a clock cycle. • Execution time is the only reliable way to tell which machine is faster. • Machine performance may also be measured using instruction throughput • How many instructions can this machine execute in 1 second? • Amdahl’s law allows us to see how much improvements we need to make to a class of instructions in order to achieve a desired order of improvement in performance.

CS 1104 Help Session II Performance Measures

CS 1104 Help Session II Performance Measures

Presentation Transcript

CS 248 OpenGL Help Session

CS 248 OpenGL Help Session

Performance Measures

CS 1104 Help Session III I/O and Buses

Performance measures

CS 223B Assignment 1 Help Session

Performance measures

PERFORMANCE MEASURES

CS 248 – Project 1 Help Session

PERFORMANCE MEASURES

CS 1110 Prelim II: Review Session

Performance measures

Performance measures

CS 1104 Help Session II Virtual Memory

Performance Measures

CS 1110 Prelim II: Review Session

CS 1110 Prelim II: Review Session

CS1104 2001/02 Semester II Help Session IIA Performance Measures

Performance Measures

Performance Measures

CS 248 OpenGL Help Session

Performance Measures