790 likes | 1.16k Vues
Qualifying Exam Review. CSCE 513 Computer Architecture. Topics Old questions Major Equations Other Questions Comments on what was missing. September 17, 2013. Syllabus for Architecture Exam.
E N D
Qualifying Exam Review CSCE 513 Computer Architecture • Topics • Old questions • Major Equations • Other Questions • Comments on what was missing September 17, 2013
Syllabus for Architecture Exam • Architecture (CSCE 513): Computer Architecture: A Quantitative Approach, 5th ed. Hennessey and Patterson, Morgan Kaufman, Chapters 1-5, 8.1-8.5, Appendix A,B,C Fundamentals of Computer Design • Instruction sets • Instruction Level parallelism • Loop unrolling and static techniques • Dynamic Techniques: Tomasulo’s, Reorder Buffer • Memory Hierarchy design • Thread Level Parallelism • Warehouse-scale computers • For further information see Dr. Matthews's CSCE 513 website (http://www.cse.sc.edu/~matthews/csce513.html) • http://www.cse.sc.edu/~fenner/qexam/index.html
Amdahl’s with Fractional Use Factor • Example:Suppose we are considering an enhancement to a web server. The enhanced CPU is 10 times faster on computation but the same speed on I/O. Suppose also that 60% of the time is waiting on I/O Ref. CAAQA
Appendix A – Instructions set Architecture (ISA) • RISC vs CISC • Uniform instruction length and format • Address modes • Operand selection • Instruction Frequency • Integer programs – Loads 26%, Stores 10% • Floating point - Loads 12%, Stores 2%
Appendix B – Basic Memory Hierarchy Design • Caches – designed to take advantage of locality • Spatial locality • Temporal locality • Organization: address = tag / set-index / block-offset • b = log2B //B Block size, b bits in block-offset • L = CacheSize / B // L – number of lines in cache • S = L / Associativity // S – number of sets • s = log2S // s – size of set-index field • 4 Organizational Decisions • Block Placement • Block identification • Block replacement • How do we handle writes?
Cache Example • Physical addresses are 13 bits wide. • The cache is 2-way set associative, with a 4 byte line size and 16 total lines. • Physical address: 0E34
6 Basic Cache Optimizations • 3 Categories: • Reducing miss rate: • Larger block size • Larger cache size • Higher associativity • Reducing miss penalty • multilevel caches • Give reads priority over writes • Reducing Hit Time • Overlaying TLB address translation with cache access
2.2 - 10 Advanced Cache Optimizations • Five Categories • Reducing Hit Time- • Small and simple first-level caches and way-prediction. Both techniques also generally decrease power consumption. • Increasing cache bandwidth— • Pipelined caches, • multibanked caches, and • nonblocking caches. These techniques have varying impacts on power consumption. • Reducing the miss penalty— • Critical word first and • merging write buffers. These optimizations have little impact on power. • Reducing the miss rate— Compiler optimizations • Reducing the miss penalty or miss rate via parallelism— • Hardware prefetching and compiler prefetching.
Data Hazards - forwarding • Forwarding options • None • Through registers • Full • Fig C.27 – paths • Fig C.26 – • Load/Use Hazards
Reorder Buffer (ROB) • Out of order execution • In-order commit
Static Tech. – Loop Unrolling, VLIW • Original Loop • Stalls • Unrolled • Scheduled
Thread Level Parallelism • Cache coherency
Ques 2 Spr 2010 • You are the lead architect of Intel's Haswell micro-architecture team. You're currently facing several design decisions involving the memory system for the upcoming line of processors. The processor will run at 3 GHz and have an average benchmark CPI of 0.2 excluding memory accesses. The only instructions that read or write data from 2memory are loads (20% of all instructions) and stores (5% of all instructions). The base memory system has the following characteristics:
Fall 2009 • 1. It’s 1997 and you’re a graduate student at Stanford named Larry Page. You’re trying to build a new Internet search engine and your strategy is to optimize its performance by ensuring that during a search, neither the CPU nor its disk array is idle. • The search database is logically divided into 100 MB contiguous blocks. After the first block is read, the engine reads subsequent blocks while using the CPU to search the previously read block. It takes 100 ms for the CPU to search each block. • You decide to use disks that each rotates at 170 revolutions/sec, has an average seek time of 8 ms, has a transfer rate of 50 MB/sec, and has a controller overhead of 2 ms. • How many disks do you need in your disk array?
Determine the total branch penalty for a branch-target buffer assuming the penalty cycles for individual mispredictions shown below: • Make the following assumptions about the prediction accuracy and hit rate: • • Prediction accuracy is 90% (for instructions in the buffer) • • Hit rate in the buffer is 90% (for branches predicted taken)
You are building a system around a processor with in-order execution that runs at 4 GHz and has a CPI of 0.7 excluding memory accesses. The only instructions that read or write data from memory are loads (20% of all instructions) and stores (5% of all instructions). • The memory system for this computer is composed of a split L1 cache that imposes no penalty on hits. Both the I-cache and D-cache are direct-mapped and hold 64 KB each. • The I-cache has a 1% miss rate and 64-byte blocks and the D-cache is write-through with a 7% miss rate and 16-byte blocks. There is a write buffer on the D-cache that eliminates stalls for 95% of all writes.
3. continued • The 12 MB write-back, unified L2 cache has 64-byte blocks and an access time of 15 ns. It is connected to the L1 cache by a 128-bit data bus that runs at 266 MHz and can transfer one 128-bit word per bus cycle. Of all memory references sent to the L2 cache in this system, 80% are satisfied without going to main memory. Also, 50% of all blocks replaced are dirty. • The 128-bit-wide main memory has an access latency of 30 ns, after which any number of bus words may be transferred at the rate of one per cycle on the 128-bit-wide 133 MHz main memory bus.
Spring 2009 Architecture 1 • Consider the following three hypothetical processors, which we characterize with a SPEC benchmark: • (a) A simple MIPS two-issue static pipe running at a clock rate of 2 GHz and achieving a pipeline CPI of 0.6. This processor has a cache system that yields 0.0025 misses per instruction on average. • (b) A deeply pipelined version of the two-issue MIPS processor with slightly smaller caches and a 2.5 GHz clock rate. The pipeline CPI of the processor is 0.8, and the smaller caches yield 0.0055 misses per instruction on average. • (c) A speculative superscalar with a 64-entry window but achieves an average issue rate of 3.5. This processor has the smallest caches, which lead to 0.01 misses per instruction, but it hides 25% of the miss penalty on every miss by dynamic scheduling. This processor has a 1.5 GHz clock. • Assume that the main memory time (which sets the miss penalty) is 50 ns. Determine the relative performance of the three processors. Hint: processor CPI can be computed by adding the pipeline CPI and cache CPI.
Spring 2009 Architecture 2 • Suppose we have an application running on a 32-processor multiprocessor, which has a 800 ns time to handle reference to a remote memory. For this application, assume that all the references except those involving communication hit in the local memory hierarchy. Processors are stalled on a remote request, and the processor clock rate is 1 GHz. If the base IPC (assuming that all references hit in the cache) is 4, how much faster is the multiprocessor if there is no communication versus if 0.4% of the instructions involve a remote communications reference?
Spring 2009 Architecture 3 • Three enhancements with the following speedups are proposed for a new architecture: • • Speedup1 = 20 • • Speedup2 = 10 • • Speedup3 = 8 • Only one enhancement is usable at a time. • (a) If enhancements 1 and 2 are each usable for 25% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? • (b) Assume the enhancements can be used 25%, 35%, and 10% of the time for enhancements 1, 2, and 3, respectively. For what fraction of the reduced execution time is no enhancement in use?
Fall 2008 Architecture 1 • (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual Pentium, but the resource requirements are not equal. The first application needs 75% of the resources, and the other only 25% of the resources. • (a) Given that 60% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? • (b) Given that 95% of the second application is parallelizable, how much speedup would this application observe if run in isolation? • (c) Given that 60% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it, but not the second application? • (d) How much overall system speedup would you achieve if you parallelized both applications, given the information in parts (a) and (b)?
Fall 2008 Architecture 2 • (Quan, Fall 2008) Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory, write back) and the code below. There is no forwarding. • Loop: LW R3,0(R0) • LW R1,0(R3) • ADDI R1,R1,#1 • SUB R4,R3,R2 • SW R1,0(R3) • BNZ R4, Loop
Fall 08 Arch. prob. 2 continued • (a) Show the phases of each instruction per clock cycle for one iteration of the above loop. • (b) How many clock cycles per loop iteration are lost to branch overhead? • (c) Assume a static branch predictor predicting always taken in the Decode stage. • Now how many clock cycles are wasted on branch overhead for this segment of code?
Fall 2008 Architecture 3 • (Quan, Fall 2008) Suppose you have a computer with the following characteristics: • 1) the processor pipeline can run an instruction each cycle • 2) the cache can provide data every cycle (i.e. no penalty for cache hits) • 3) the instruction cache miss rate is 1% • 4) the data cache miss rate is 5% • 5) 20% of instructions are memory instructions • 6) the cache miss penalty is 80 cycles.
Fall 08 Arch. prob. 3 continued • Assume that you have decided to purchase a new computer. For the budget allocated, you can either • 1) purchase a machine with a processor and cache that is twice as fast as your current one (memory speed is the same as the old machine, i.e., the cache miss penalty is 160 cycles), or • 2) purchase a machine with a processor and cache that is the same speed as your old machine but in which the cache is twice as large and the cache miss rate for the programs you run will drop by 40% with this larger cache. • Which computer are you best off purchasing? Explain in detail, showing the relative performance of each choice.
Fall 2004 Exam – True/False • Amdahl’s Law • Same ISA compare by MIPs? • Large variety of memory addr modes degrade performance due increase in CPIs or ICs • Ideal speedup of pipelined processor = number of stages • Structural hazards can be resolved by adding enough hardware, e.g. floating pt adders etc.