Computer Systems: From Software to Hardware

Introduction to Hardware/Architecture David A. Patterson http://cs.berkeley.edu/~patterson/talks {patterson,kkeeton}@cs.berkeley.edu EECS, University of California Berkeley, CA 94720-1776

What is a Computer System? Application (Netscape) Operating Compiler System (Windows 98) • Coordination of many levels of abstraction Software Assembler Instruction Set Architecture Hardware Processor Memory I/O system Datapath & Control Digital Design Circuit Design transistors

Levels of Representation temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; High Level Language Program (e.g., C) lw $to, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2) Compiler Assembly Language Program (e.g.,MIPS) Assembler Machine Language Program (MIPS) 0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 Machine Interpretation Control Signal Specification ° °

The Instruction Set: a Critical Interface software instruction set hardware

SOFTWARE Instruction Set Architecture (subset of Computer Arch.) “... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.” – Amdahl, Blaaw, and Brooks, 1964 -- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Set -- Instruction Formats -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions

Anatomy: 5 components of any Computer Personal Computer Keyboard, Mouse Computer Processor (active) Memory (passive) (where programs, data live when running) Devices Disk(where programs, data live when not running) Input Control (“brain”) Datapath (“brawn”) Output Display, Printer Processor often called (IBMese) “CPU” for “Central Processor Unit”

Technology Trends: Microprocessor Capacity Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law”:

Technology Trends: Processor Performance 1.54X/yr Processor performance increase/yr mistakenly referred to as Moore’s Law (transistors/chip)

Computer Technology=>Dramatic Change • Processor • 2X in speed every 1.5 years; 1000X performance in last 15 years • Memory • DRAM capacity: 2x / 1.5 years; 1000X size in last 15 years • Cost per bit: improves about 25% per year • Disk • capacity: > 2X in size every 1.5 years • Cost per bit: improves about 60% per year • 120X size in last decade • State-of-the-art PC “when you graduate” (1997-2001) • Processor clock speed: 1500 MegaHertz (1.5 GigaHertz) • Memory capacity: 500 MegaByte (0.5 GigaBytes) • Disk capacity: 100 GigaBytes (0.1 TeraBytes) • New units! Mega => Giga, Giga => Tera

Integrated Circuit Costs Die cost = Wafer cost Dies per Wafer * Die yield Dies Flaws Die Cost is goes roughly with the cube of the area: fewer dies per wafer * yield worse with die area

Die Yield (1993 data) • Raw Dices Per Wafer • wafer diameter die area (mm2)100 144 196 256 324 400 • 6”/15cm 139 90 62 44 32 23 • 8”/20cm 265 177 124 90 68 52 • 10”/25cm 431 290 206 153 116 90 • die yield 23% 19% 16% 12% 11% 10% • typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer • Good Dices Per Wafer (Before Testing!) • 6”/15cm 31 16 9 5 3 2 • 8”/20cm 59 32 19 11 7 5 • 10”/25cm 96 53 32 20 13 9 • typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000

1993 Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

Other Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation • Chip Die Package Test & Total cost pins type cost Assembly • 386DX $4 132 QFP $1 $4 $9 • 486DX2 $12 168 PGA $11 $12 $35 • PowerPC 601 $53 304 QFP $3 $21 $77 • HP PA 7100 $73 504 PGA $35 $16 $124 • DEC Alpha $149 431 PGA $30 $23 $202 • SuperSPARC $272 293 PGA $20 $34 $326 • Pentium $417 273 PGA $19 $37 $473

System Cost: 1995-96 Workstation System Subsystem % of total cost Cabinet Sheet metal, plastic 1% Power supply, fans 2% Cables, nuts, bolts 1%(Subtotal)(4%) Motherboard Processor 6% DRAM (64MB) 36% Video system 14% I/O system 3% Printed Circuit board 1% (Subtotal)(60%) I/O Devices Keyboard, mouse 1% Monitor 22% Hard disk (1 GB) 7% Tape drive (DAT) 6% (Subtotal)(36%)

COST v. PRICE (WS–PC) Q: What % of company income on Research and Development (R&D)? list price +50–80% Average Discount (33–45%) avg. selling price +25–100% Gross Margin gross margin (33–14%) +33% Direct Costs direct costs direct costs (8–10%) Component Cost component cost component cost component cost (25–31%) Input: chips, displays, ... Making it: labor, scrap, returns, ... Overhead: R&D, rent, marketing, profits, ... Commision: channel profit, volume discounts,

Outline • Review of Five Technologies: Processor, Memory, Disk, Network Systems • Description / History / Performance Model • State of the Art / Trends / Limits / Innovation • Common Themes across Technologies • Perform.: per access (latency) + per byte (bandwidth) • Fast: Capacity, BW, Cost; Slow: Latency, Interfaces • Moore’s Law affecting all chips in system

CPU time = Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock Processor Trends/ History • Microprocessor: main CPU of “all” computers • < 1986, +35%/ yr. performance increase (2X/2.3yr) • >1987 (RISC), +60%/ yr. performance increase (2X/1.5yr) • Cost fixed at $500/chip, power whatever can cool • History of innovations to 2X / 1.5 yr • Pipelining (helps seconds / clock, or clock rate) • Out-of-Order Execution (helps clocks / instruction) • Superscalar (helps clocks / instruction) • Multilevel Caches (helps clocks / instruction)

Pipelining is Natural! A B C D • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutesto put clothes into drawers

Sequential Laundry 2 AM 12 6 PM 1 8 7 11 10 9 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 T a s k O r d e r Sequential laundry takes 8 hours for 4 loads Time A B C D

Pipelined Laundry: Start work ASAP 2 AM 12 6 PM 1 8 7 11 10 9 Time Pipelined laundry takes 3.5 hours for 4 loads! 30 30 30 30 30 30 30 T a s k O r d e r A B C D

30 30 30 30 30 30 30 bubble D Pipeline Hazard: Stall 2 AM 12 6 PM 1 8 7 11 10 9 Time A depends on D; stall since folder tied up T a s k O r d e r A B C E F

bubble Out-of-Order Laundry: Don’t Wait 2 AM 12 6 PM 1 8 7 11 10 9 Time A depends on D; rest continue; need more resources to allow out-of-order 30 30 30 30 30 30 30 T a s k O r d e r A B C D E F

30 30 30 30 30 (light clothing) A (dark clothing) B (very dirty clothing) C (light clothing) (dark clothing) (very dirty clothing) Superscalar Laundry: Parallel per stage 2 AM 12 6 PM 1 8 7 11 10 9 Time More resources, HW match mix of parallel tasks? T a s k O r d e r D E F

(light clothing) A D B C Superscalar Laundry: Mismatch Mix 2 AM 12 6 PM 1 8 7 11 10 9 Time Task mix underutilizes extra resources 30 30 30 30 30 30 30 T a s k O r d e r (light clothing) (dark clothing) (light clothing)

State of the Art: Alpha 21264 • 15M transistors • 2 64KB caches on chip; 16MB L2 cache off chip • Clock <1.7 nsec, or >600 MHz (Fastest Cray Supercomputer: T90 2.2 nsec) • 90 watts • Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle • Execution out-of-order

Today’s Situation: Microprocessor MIPS MPUs R5000 R10000 10k/5k • Clock Rate 200 MHz 195 MHz 1.0x • On-Chip Caches 32K/32K 32K/32K 1.0x • Instructions/Cycle 1(+ FP) 4 4.0x • Pipe stages 5 5-7 1.2x • Model In-order Out-of-order --- • Die Size (mm2) 84 298 3.5x • without cache, TLB 32 205 6.3x • Development (man yr..) 60 300 5.0x • SPECint_base95 5.7 8.8 1.6x

Memory History/Trends/State of Art • DRAM: main memory of all computers • Commodity chip industry: no company >20% share • Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM) • State of the Art: $152, 128 MB DIMM (16 64-Mbit DRAMs),10 ns x 64b (800MB/sec) • Capacity: 4X/3 yrs (60%/yr..) • Moore’s Law • MB/$: + 25%/yr. • Latency: – 7%/year, Bandwidth: + 20%/yr. (so far) source: www.pricewatch.com, 5/21/98

Memory Summary • DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency • Processor-memory interface (cache+memory bus) is bottleneck to delivered bandwidth • Like network, memory “protocol” is major overhead

Processor Innovations/Limits • Low cost , low power embedded processors • Lots of competition, innovation • Integer perf. embedded proc. ~ 1/2 desktop processor • Strong ARM 110: 233 MHz, 268 MIPS, 0.36W typ., $49 • Very Long Instruction Word (Intel,HP IA-64/Merced) • multiple ops/ instruction, compiler controls parallelism • Consolidation of desktop industry? Innovation? x86 IA-64 SPARC Alpha PowerPC MIPS PA-RISC

Processor Summary • SPEC performance doubling / 18 months • Growing CPU-DRAM performance gap & tax • Running out of ideas, competition? Back to 2X / 2.3 yrs? • Processor tricks not as useful for transactions? • Clock rate increase compensated by CPI increase? • When > 100 MIPS on TPC-C? • Cost fixed at ~$500/chip, power whatever can cool • Embedded processors promising • 1/10 cost, 1/100 power, 1/2 integer performance?

µProc 60%/yr.. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr.. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Processor Limit: DRAM Gap • Alpha 21264 full cache miss in instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors

The Goal: Illusion of large, fast, cheap memory • Fact: Large memories are slow, fast memories are small • How do we create a memory that is large, cheap and fast (most of the time)? • Hierarchy of Levels • Similar to Principle of Abstraction: hide details of multiple levels

Hierarchy Analogy: Term Paper in Library • Working on paper in library at a desk • Option 1: Every time need a book • Leave desk to go to shelves (or stacks) • Find the book • Bring one book back to desk • Read section interested in • When done with section, leave desk and go to shelves carrying book • Put the book back on shelf • Return to desk to work • Next time need a book, go to first step

Memory Hierarchy Analogy: Library • Option 2: Every time need a book • Leave some books on desk after fetching them • Only go to shelves when need a new book • When go to shelves, bring back related books in case you need them; sometimes you’ll need to return books not used recently to make space for new books on desk • Return to desk to work • When done, replace books on shelves, carrying as many as you can per trip • Illusion: whole library on your desktop • Buzzword “cache” from French for hidden treasure

Probability of reference 0 2^n - 1 Address Space Why Hierarchy works: Natural Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • What programming constructs lead to Principle of Locality?

Memory Hierarchy: How Does it Work? • Temporal Locality (Locality in Time):  Keep most recently accessed data items closer to the processor • Library Analogy: Recently read books are kept on desk • Block is unit of transfer (like book) • Spatial Locality (Locality in Space):  Move blocks consists of contiguous words to the upper levels • Library Analogy: Bring back nearby books on shelves when fetch a book; hope that you might need it later for your paper

Central Processor Unit (CPU) Increasing Distance from CPU,Decreasing cost / MB “Upper” Level 1 Level 2 Level 3 “Lower” . . . Size of memory at each level Memory Hierarchy Pyramid Levels in memory hierarchy Level n (data cannot be in level i unless also in i+1)

Big Idea of Memory Hierarchy • Temporal locality: keep recently accessed data items closer to processor • Spatial locality: moving contiguous words in memory to upper levels of hierarchy • Uses smaller and faster memory technologies close to the processor • Fast hit time in highest level of hierarchy • Cheap, slow memory furthest from processor • If hit rate is high enough, hierarchy has access time close to the highest (and fastest) level and size equal to the lowest (and largest) level

Focus on I/O Recall : 5 components of any Computer Keyboard, Mouse Computer Processor (active) Devices Memory (passive) (where programs, data live when running) Input Control (“brain”) Disk, Network Output Datapath (“brawn”) Display, Printer

Disk Description / History Track Embed. Proc. (ECC, SCSI) Sector Track Buffer Arm Head Platter 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes Cylinder source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces”

Disk History 2000: 10,100 Mb/s. i. 25,000 MBytes 2000: 11,000 Mb/s. i. 73,400 MBytes 1989: 63 Mbit/sq. in 60,000 MBytes 1997: 1450 Mbit/sq. in 2300 Mbytes (2.5” diameter) 1997: 3090 Mbit/s. i. 8100 Mbytes (3.5” diameter) source: N.Y. Times, 2/23/98, page C3

Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth { per access + per byte State of the Art: Ultrastar 72ZX Embed. Proc. Track • 73.4 GB, 3.5 inch disk • 2¢/MB • 16 MB track buffer • 11 platters, 22 surfaces • 15,110 cylinders • 7 Gbit/sq. in. areal density • 17 watts (idle) • 0.1 ms controller time • 5.3 ms avg. seek (seek 1 track => 0.6 ms) • 3 ms = 1/2 rotation • 37 to 22 MB/s to media Sector Cylinder Track Buffer Arm Platter Head source: www.ibm.com; www.pricewatch.com; 2/14/00

Disk Limit • Continued advance in capacity (60%/yr) and bandwidth (40%/yr.) • Slow improvement in seek, rotation (8%/yr) • Time to read whole disk Year Sequentially Randomly 1990 4 minutes 6 hours 2000 12 minutes 1 week • Dynamically change data layout to reduce seek, rotation delay? Leverage space vs. spindles?

A glimpse into the future? • IBM microdrive for digital cameras • 340 Mbytes • Disk target in 5-7 years? • building block: 2006 MicroDrive • 9GB disk, 50 MB/sec from disk • 10,000 nodes fit into one rack!

Disk Summary • Continued advance in capacity, cost/bit, BW; slow improvement in seek, rotation • External I/O bus bottleneck to transfer rate, cost?=> move to fast serial lines (FC-AL)? • What to do with increasing speed of embedded processor inside disk?

Connecting to Networks (and Other I/O) • Bus - shared medium of communication that can connect to many devices • Hierarchy of Buses in a PC

Memory bus PCI: Internal(Backplane) I/O bus CPU SCSI: External I/O bus Memory Ethernet Interface SCSI Interface (1 to 15 disks) Ethernet Local Area Network Buses in a PC • Data rates • Memory: 100 MHz, 8 bytes 800 MB/s (peak) • PCI: 33 MHz, 4 bytes wide  132 MB/s (peak) • SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes) 80 MB/s (peak)

Why Networks? • Originally sharing I/O devices between computers (e.g., printers) • Then Communicating between computers (e.g, file transfer protocol) • Then Communicating between people (e.g., email) • Then Communicating between networks of computers  Internet, WWW

Types of Networks • Local Area Network (Ethernet) • Inside a building: Up to 1 km • (peak) Data Rate: 10 Mbits/sec, 100 Mbits/sec,1000 Mbits/sec • Run, installed by network administrators • Wide Area Network • Across a continent (10km to 10000 km) • (peak) Data Rate: 1.5 Mbits/sec to 2500 Mbits/sec • Run, installed by telephone companies

ABCs of Networks: 2 Computers • Starting Point: Send bits between 2 computers • Queue (First In First Out) on each end • Can send both ways (“Full Duplex”) • Information sent called a “message” • Note: Messages also called packets

Computer Systems: From Software to Hardware