Graduate Computer Architecture I

Graduate Computer Architecture I Lecture 1: Introduction Young Cho

Computer Architecture Forces that shape Computer Architecture Operating System Application Programming Languages History Technology

Instruction Set Architecture A. Set of Elementary Instructions v2.0 Q. What is ISA ?? • Lasts through many generations - PORTABILITY ROBOT! Draw me a Basketball Player! • Used in many different ways - GENERALITY • Permits an EFFICIENT functionality to higher levels • Provides CONVENIENT • functionality to higher levels A “Good” ISA …

Instruction Set Architecture • Programmer’s Point of View • Organization of Programmable Storage • Data Types & Structures: Encodings & Representations • Instruction Formats • Instruction (or Operation Code) Set • Modes of Addressing and Accessing Data Items and Instructions • Exceptional Conditions • Logic Designer’s Point of View • Capabilities & Performance Characteristics of Principal Functional Units (e.g., Registers, ALU, Shifters, Logic Units, ...) • Ways in which these components are interconnected • Information flows between components • Logic and means by which such information flow is controlled. • Choreography of FUs to realize the ISA • Register Transfer Level (RTL) Description

Execution Cycle program Obtain instruction from storage Instruction Fetch Storage Instruction Decode Determine required functionality Processor regs Operand Fetch Locate and obtain operand data Data F.U.s Find result or status Execute Result Store Deposit results in storage for later Next Instruction Determine next instruction

Elements of ISA Next Logical State of the Machine Current Logical State of the Machine • Set of machine-recognized data types • bytes, words, integers, floating point, strings, . . . • Operations performed on those data types • Add, sub, mul, div, xor, move, …. • Programmable storage • regs, PC, memory • Methods of identifying and obtaining data referenced by instructions (addressing modes) • Literal, reg., absolute, relative, reg + offset, … • Format (encoding) of the instructions • Op code, operand fields, …

Example: MIPS R3000 r0 r1 ° ° ° r31 0 Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? PC lo hi • Arithmetic logical • Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, • AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI • SLL, SRL, SRA, SLLV, SRLV, SRAV • Memory Access • LB, LBU, LH, LHU, LW, LWL,LWR • SB, SH, SW, SWL, SWR • Control • J, JAL, JR, JALR • BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL 32-bit instructions on word boundary

History of ISA Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based (Stack) Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC Intel X86? (MIPS,Sparc,HP-PA,IBM RS6000, 1987)

Technology Trends: Microprocessor Capacity Itanium II: 241 million Pentium 4: 55 million Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law • CMOS improvements: • Die size: 2X every 3 yrs • Line width: halve / 7 yrs

Memory Capacity (Single Chip DRAM) year size(Mb) cyc time 1980 0.0625 250 ns 1983 0.25 220 ns 1986 1 190 ns 1989 4 165 ns 1992 16 145 ns 1996 64 120 ns 2000 256 100 ns 2003 1024 60 ns

Technology Trends • Clock Rate: ~30% per year • Transistor Density: ~35% per year • Chip Area: ~15% per year • Transistors per chip: ~55% per year • Total Perf Capability: ~100% per year • Storage • DRAMs ~4x every 3-4 years • Disk Density (60% per year) • Network bandwidth • ~2x per year for the last decade

Performance Trends based on Types

Processor Performance 1200 1.54X/yr 1000 DEC Alpha 5/500 800 DEC Alpha 5/300 DEC Alpha 4/266 DEC Alpha 21164/600 IBM POWER 100 600 DEC AXP/500 HP 9000/750 400 MIPS M/120 IBM RS/6000 MIPS M/2000 Sun-4/260 200 0 87 88 89 90 91 92 93 94 95 96 97

Definition: Performance • Performance is in units of things per second • bigger is better • If we are primarily concerned with response time • " X is n times faster than Y" means

Metrics of Performance Application Answers per day/month Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins

Components of Performance CPI Inst Count Cycle Time

What’s a Clock Cycle? • Old days: 10 levels of gates • Today: determined by numerous time-of-flight issues + gate delays • clock propagation, wire lengths, drivers Latch or register combinational logic

Architecture as a Whole • Complete System Integration • i.e. hardware, runtime system, compiler, and operating system • Networking, this is called the “End to End argument” • Computer Architecture is Not Just About • Transistors • Individual Instructions • Particular Implementations • Original RISC projects • Exotic and complex instruction • A compiler + simple instructions

True Speed Measurement Oh, MAN! Swiper! No Swiping! Sail Boat Speed: 10 miles/hour Race Car: 100 miles/hour Running: 5 miles/hour 100 miles 0.1 mile 10 hours 3.6 secs vs 72 secs

Amdahl’s Law Best you could ever hope to do:

Graduate Computer Architecture I Administrative

Course Information • Course Web Site • http://www.arl.wustl.edu/~young/cse560m • Prerequisites • CSE 361S/CS 306S and CSE 260M • Times and Location • Lecture: Tuesday and Thursday 2:30 PM - 4:00 PM • In Cupples II - Room 200 • Text book • J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Third edition, Morgan-Kaufmann, 2003. (ISBN: 1-55860-724-2). • (Optional) D. Patterson and J. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Third Edition. • (Optional) P. Ashenden, The Student's Guide to VHDL, Morgan-Kaufmann, 2003. (ISBN: 1-55860-520-7).

Course Goal • General Overview of the Course : Three Parallel Tracks • Lectures – Fast paced • Literature Survey – Write exam quality question and answer related to the literature • Project – Leading up to Final Project with Conference format document and Powerpoint slide presentation • Grading • Class Participation 10% • Survey Homework 20% • Exams/Quizzes 20% • Final Project 50% • Late Policy • No late work unless special situations • Academic Integrity • Highest standards of academic integrity • Refrain from the forms of misconduct • Violations will lead to disciplinary action • Uphold the highest standards of scholarship • Plagiarism or other forms of cheating are not tolerated

Course Assignments • Literatures • First two papers on-line “What’s Next?” and “System Research” • One Q&A for assigned paper is due at the start of the class • Will use the best Questions for in-class Quizzes/Exams • Project • Will be using Xilinx ISE • Available in Computer Lab • May want to register and download webpack • http://www.xilinx.com/ise/logic_design_prod/webpack.htm • Limited but maybe suffice for smaller modules • Complete Tutorials and Start building 16-bit Structural VHDL ALU • Diagnostic Quizzes • Quiz A and B are take home quizzes designed to help you • May want to refer to “Computer Organization and Design: The Hardware/Software Interface” • Attempt all the problems to receive full score • You should know most of their contents by the end of 3rd week

General Tips • Try not to miss a Lecture • Get to know the Instructor • Don’t worry about the Grades • Help from Other Students and Professor • Produce Professional Results • Quality Documentation and Presentation

Graduate Computer Architecture I Course Preview

Pipelined Instruction Execution Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r

Limits to Pipelining • Maintain the von Neumann “illusion” of one instruction at a time execution • Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: attempt to use the same hardware to do two different things at once • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

Instruction Level Parallelism • 1st generation RISC - pipelined • Full 32-bit processor fit on a chip  issue almost 1 IPC • Need to access memory 1+x times per cycle • Floating-Point unit on another chip • Cache controller a third, off-chip cache • 1 board per processor  multiprocessor systems • 2nd generation: superscalar • Processor and floating point unit on chip (and some cache) • Issuing only one instruction per cycle uses at most half • Fetch multiple instructions, issue couple • Grows from 2 to 4 to 8 … • How to manage dependencies among all these instructions? • Where does the parallelism come from? • VLIW • Expose some of the ILP to compiler, allow it to schedule instructions to reduce dependences

Modern ILP • Deep Pipelines • Dynamically Scheduled • Out-of-Order Execution • Fetch Many (10’s) Instructions per cycle • Automatically Resolve Dependencies • Retains Sequential Instruction Execution • Precise Interrupt or Exception Handling • Huge Complexity – Is it still RISC?

How is ILP going to be Exploited? • Multi-Threading • Thread: loci of control, execution context • Fetch instructions from multiple threads at once, throw them all into the execution unit • Concept has existed in high performance computing for 20 years (or is it 40? CDC6600) • Vector processing (SIMD) • Each instruction processes many data • Ex: MMX, 3D Now, AltiVec, and etc. • Multiple Processors per Chip • Higher Level of Architecture • Interconnects • Memory Architecture Multi-processor Chip Tensilica Configurable Proc

Performance Boost: Speculation • Programs make decisions as they go • Conditionals, loops, calls • Translate into branches and jumps (1 of 5 instructions) • How do you determine what instructions for fetch when the ones before it haven’t executed? • Branch prediction • Lot’s of clever machine structures to predict future based on history • Machinery to back out of mis-predictions • Execute all the possible branches • Likely to hit additional branches, perform stores • Speculative threads • What can hardware do to make programming (with performance) easier?

The Memory Abstraction • Association of <name, value> pairs • typically named as byte addresses • often values aligned on multiples of size • Sequence of Reads and Writes • Write binds a value to an address • Read of Addr returns most recently written value bound to that address command (R/W) address (name) data (W) data (R) done

Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Joy’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes << 1s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache 10s-100s K Bytes ~1 ns $1s/ MByte Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 100ns- 300ns $< 1/ MByte Memory OS 512-4K bytes Pages Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.001/ MByte Disk user/operator Mbytes Files Larger Tape infinite sec-min $0.0014/ MByte Tape Lower Level

Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality(Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality(Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 30 years, HW relied on locality for speed MEM P $

Cache Design • Several interacting dimensions • cache size • block size • associativity • replacement policy • write-through vs write-back • The optimal choice is a compromise • depends on access characteristics • workload • use (I-cache, D-cache, TLB) • depends on technology / cost • Simplicity often wins Cache Size Associativity Block Size

Role of Memory • Modern microprocessors are almost all cache

Memory in Parallel Systems P P n 1 P P n 1 $ $ $ $ Mem Mem Inter connection network Inter connection network Mem Mem • Maintaining the Illusion of Seq Mem Access • Multiple Processors Accessing Same Memory • Consistency in Distributed and Shared Memory

Inter-Communication Proc Caches Busses adapters Memory Controllers Disks Displays Keyboards I/O Devices: Networks Pentium III Chipset

Merging of HW/SW Design • Moore’s law (more and more trans) is all about volume and regularity • What if you could pour nano-acres of unspecific digital logic “stuff” onto silicon • Do anything with it. Very regular, large volume • Field Programmable Gate Arrays • Chip is covered with logic blocks w/ FFs, RAM blocks, and interconnect • All three are “programmable” by setting configuration bits • These are huge? • Can each program have its own instruction set? • Do we compile the program entirely into hardware?

Tiny and Low Power System • Tiny and Cheap System • System on a chip • Resource efficiency • Real-estate, power, pins, …

Conclusion • A lot of work, but Good Results • Should be a Dynamic Course • Get Introductory and Book Done ASAP • Hope to have Progressive and Fun Course

Graduate Computer Architecture I