Computers for the Post-PC Era

Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu May 2000

Perspective on Post-PC Era • PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices • ubiquitous: in everything • e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers

VIRAM-1 Block Diagram

C P U+$ VIRAM-1: System on a Chip • Prototype scheduled for tape-out mid 2000 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) 4 Vector Pipes/Lanes Xbar I/O Memory(64 Mbits / 8 MBytes)

0 1 15 0 1 16 16 16 15 Problem: General Element Permutation • Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) • Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) • Control: 16 16-to-1 multiplexors; scales by 0(N*logN) • Other problems • Consecutive result elements not written together; time/energy wasted on wide vector register file port

0 1 15 Simple Vector Permutations • Simple steps of butterfly permutations • A register provides the butterfly radix • Separate instructions for moving elements to left/right • Sufficient semantics for • Fast reductions of vector registers (dot products) • Fast FFT/DCT kernels

64 64 64 64 shift shift 0 3 Hardware for Simple Permutations • Hardware for 128 16b elements, 256b datapath • Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) • Control: 6 control cases; scales by O(N) • Other benefits • Consecutive result elements written together; • Buses used only for small radices

FFT: Straight forward Problem: most time spent in short vectors in later stages of FFT

FFT: Transpose inside Vector Regs

FFT: Straight forward

MIPS scalar core Synthesizable RTL code received from MIPS Cache RAMs to be compiled for IBM technology FPU RTL code almost compete Vector unit RTL models for sub-blocks developed; currently integrated and tested Control logic to be compiled for IBM technology Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memorysystem Synthesizable model for DRAM controllers done To be integrated with IBM DRAM macros Full-custom layout for crossbar under development Testing infrastructure Environment developed for automatic test & validation Directed tests for single/multiple instruction groups developed Random instruction sequence generator developed VIRAM-1 Design Status

FPU Features • Executes MIPS IV ISA single-precision FP instructions • Thirty-two 32-bit Floating Point Registers • Two 32-bit Control Registers • One 3-cycle (division takes 10 cycles) fully pipelined, nearly full IEEE-754 compliant, execution unit(from Albert Ma@MIT) • 6-stage pipeline (R-X-X-X-CDB-WB) • Support for partial out-of-order execution and precise exceptions • Scalar Core dispatches FP instructions to FPU using an interface that splits instructions into 3 classes: • Arithmetic instructions (ADD.S, SUB.S, MUL.S, DIV.S, ABS.S, NEG.S, C.cond.S, CVT.S.W, CVT.W.S, TRUNC.W.S, MOV.S, MOVZ.S, MOVN.S) • From Coprocessor Data Transfer instructions (SWC1, MFC1, CFC1) • To Coprocessor Data Transfer instructions (LWC1, MTC1, CTC1)

FPU Architecture

Multiplier Partitioning • 64-bit multiplier built from 16-bit multiplier subblocks • Subblocks combined with adders to perform larger multiplies • Performs 2 simultaneous 32-bit multiplies by grouping 4 subblocks • Performs 4 simultaneous 16-bit multiplies by using individual subblocks • Unused blocks turned off to conserve power

FPU Current Status • Current Functionality • Able to execute most instructions (all except C.cond.S, CFC1 and CTC1). • Supports precise exception semantics. • Functionality verification. • Used a random test generator that generates/kills instructions at random and compares the results from the RTL Verilog simulator against the results from an ISA Perl simulator. • What remains to be done • Instructions that use the Control Registers (C.cond.S, CFC1 and CTC1). • Exception generation. • Integrate execution pipeline with the rest of the design. • Synthesize, place and route. • Final assembly and verification of multiplier • Performance • Sustainable Throughput: 1 instruction/cycle (assuming no data hazards) • Instruction Latency: 6 cycles

UC-IBM Agreement • Biggest IRAM Obstacle:Intellectual Property Agreement between University of California and IBM • Can university accept free fab costs ($2.0M to $2.5M) in return for capped non-exclusive patent licensing fees for IBM if UC files for IRAM patents? • Process started with IBM March 1999 • IBM won’t give full process info until contract • UC started negotiating seriously Jan 2000 • Agreement June 1, 2000!

Other examples: IBM “Blue Gene” • 1 PetaFLOPS in 2005 for $100M? • Application: Protein Folding • Blue Gene Chip • 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip • 1 GFLOPS / processor • 2’ x 2’ Board = 64 chips (2K CPUs) • Rack = 8 Boards (512 chips,16K CPUs) • System = 64 Racks (512 boards,32K chips,1M CPUs) • Total 1 million processors in just 2000 sq. ft.

Other examples: Sony Playstation 2 • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) • Superscalar MIPS core + vector coprocessor + graphics/DRAM • Claim: “Toy Story” realism brought to games

Outline 1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Benchmarks for AME • Conclusions and future work

Lampson: Systems Challenges • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter “Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft

Hennessy: What Should the “New World” Focus Be? • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow

Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance • “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • compare: biological systems spend > 50% of resources on maintenance • can make up performance by scaling system

Principles for achieving AME (2) • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen

Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors...

ISTORE-1 Status • 10 Nodes manufactured • Boots OS • Diagnostic Processor Interface SW complete • PCB backplane: not yet designed • Finish 80 node system: Summer 2000

Hardware techniques • Fully shared-nothing cluster organization • truly scalable architecture • architecture that tolerates partial failure • automatic hardware redundancy

Hardware techniques (2) • No Central Processor Unit: distribute processing with storage • Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems • Most storage servers limited by speed of CPUs; why does this make sense? • Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? • If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage

Hardware techniques (3) • Heavily instrumented hardware • sensors for temp, vibration, humidity, power, intrusion • helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node • provides remote control of power, remote console access to the node, selection of node boot code • collects, stores, processes environmental data for abnormalities • non-volatile “flight recorder” functionality • all diagnostic processors connected via independent diagnostic network

Hardware techniques (4) • On-demand network partitioning/isolation • Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance • Allows testing, repair of online system • Managed by diagnostic processor and network switches via diagnostic network

Hardware techniques (5) • Built-in fault injection capabilities • Power control to individual node components • Injectable glitches into I/O and memory busses • Managed by diagnostic processor • Used for proactive hardware introspection • automated detection of flaky components • controlled testing of error-recovery mechanisms • Important for AME benchmarking (see next slide)

“Hardware” techniques (6) • Benchmarking • One reason for 1000X processor performance was ability to measure (vs. debate) which is better • e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? • Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks

Benchmark Availability?Methodology for reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior? • 99% confidence intervals calculated from no-fault runs

Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not

Conclusions (1): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks a powerful tool? • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Conclusions (2) • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Gadgets: Embedded/Mobile devices • Infrastructure: Intelligent Storage and Networks • PostPC infrastructure requires • New Goals: Availability, Maintainability, Evolution • New Principles: Introspection, Performance Robustness • New Techniques: Isolation/fault insertion, Software scrubbing • New Benchmarks: measure, compare AME metrics

Computers for the Post-PC Era