Computers for the Post-PC Era

Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu February 2000

Perspective on Post-PC Era • PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices • ubiquitous: in everything • e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers

Outline 1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work

New Architecture Directions • “…media processing will become the dominant force in computer arch. and microprocessor design.” • “...new media-rich applications ... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, 32-bit integer and Fl. Pt.” • Needs include real-time response, continuous media data types (no temporal locality), fine grain parallelism, coarse grain parallelism, memory bandwidth • “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEEComputer (9/97)

L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: • 10X capacity vs. SRAM • on-chip memory latency 5-10X, bandwidth 50-100X • improve energy efficiency 2X-4X (no off-chip bus) • serial I/O 5-10X v. buses • smaller board area/volume IRAM advantages extend to: • a single chip system • a building block for larger systems

Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years)(We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep performance Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Revive Vector Architecture

I/O I/O I/O I/O V-IRAM1: Low Power v. High Perf. 4 x 64 or 8 x 32 or 16 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 16K I cache 16K D cache 4 x 64 4 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 4 x 64 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … … … M M M M M M M M M M

C P U+$ 4 Vector Pipes/Lanes VIRAM-1: System on a Chip • Prototype scheduled for tape-out mid 2000 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar I/O Memory(64 Mbits / 8 MBytes)

Media Kernel Performance

Base-line system comparison • All numbers in cycles/pixel • MMX and VIS results assume all data in L1 cache

IRAM Chip Challenges • Merged Logic-DRAM process Cost: Cost of wafer, Impact on yield, testing cost of logic and DRAM • Price: on-chip DRAM v. separate DRAM chips? • Delay in transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only • DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block • Apps: advantages in memory bandwidth, energy, system size to offset challenges?

Other examples: IBM “Blue Gene” • 1 PetaFLOPS in 2005 for $100M? • Application: Protein Folding • Blue Gene Chip • 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip • 1 GFLOPS / processor • 2’ x 2’ Board = 64 chips (2K CPUs) • Rack = 8 Boards (512 chips,16K CPUs) • System = 64 Racks (512 boards,32K chips,1M CPUs) • Total 1 million processors in just 2000 sq. ft.

Other examples: Sony Playstation 2 • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) • Superscalar MIPS core + vector coprocessor + graphics/DRAM • Claim: “Toy Story” realism brought to games

The problem space: big data • Big demand for enormous amounts of data • today: high-end enterprise and Internet applications • enterprise decision-support, data mining databases • online applications: e-commerce, mail, web, archives • future: infrastructure services, richer data • computational & storage back-ends for mobile devices • more multimedia content • more use of historical data to provide better services • Today’s SMP server designs can’t easily scale • Bigger scaling problems than performance!

Lampson: Systems Challenges • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter “Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft

Hennessy: What Should the “New World” Focus Be? • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow

The ISTORE project vision • Our goal: develop principles and investigate hardware/software techniques for building storage-based server systems that: • are highly available • require minimal maintenance • robustly handle evolutionary growth • are scalable to O(10000) nodes

Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance • “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • compare: biological systems spend > 50% of resources on maintenance • can make up performance by scaling system

Principles for achieving AME (2) • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen

Hardware techniques • Fully shared-nothing cluster organization • truly scalable architecture • architecture that tolerates partial failure • automatic hardware redundancy

Hardware techniques (2) • No Central Processor Unit: distribute processing with storage • Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems • Most storage servers limited by speed of CPUs; why does this make sense? • Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? • If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage

Hardware techniques (3) • Heavily instrumented hardware • sensors for temp, vibration, humidity, power, intrusion • helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node • provides remote control of power, remote console access to the node, selection of node boot code • collects, stores, processes environmental data for abnormalities • non-volatile “flight recorder” functionality • all diagnostic processors connected via independent diagnostic network

Hardware techniques (4) • On-demand network partitioning/isolation • Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance • Allows testing, repair of online system • Managed by diagnostic processor and network switches via diagnostic network

Hardware techniques (5) • Built-in fault injection capabilities • Power control to individual node components • Injectable glitches into I/O and memory busses • Managed by diagnostic processor • Used for proactive hardware introspection • automated detection of flaky components • controlled testing of error-recovery mechanisms • Important for AME benchmarking (see next slide)

“Hardware” techniques (6) • Benchmarking • One reason for 1000X processor performance was ability to measure (vs. debate) which is better • e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? • Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors...

A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

Software techniques • Fully-distributed, shared-nothing code • centralization breaks as systems scale up O(10000) • avoids single-point-of-failure front ends • Redundant data storage • required for high availability, simplifies self-testing • replication at the level of application objects • application can control consistency policy • more opportunity for data placement optimization

Software techniques (2) • “River” storage interfaces • NOW Sort experience: performance heterogeneity is the norm • e.g., disks: outer vs. inner track (1.5X), fragmentation • e.g., processors: load (1.5-5x) • So demand-driven delivery of data to apps • via distributed queues and graduated declustering • for apps that can handle unordered data delivery • Automatically adapts to variations in performance of producers and consumers • Also helps with evolutionary growth of cluster

Software techniques (3) • Reactive introspection • Use statistical techniques to identify normal behavior and detect deviations from it • Policy-driven automatic adaptation to abnormal behavior once detected • initially, rely on human administrator to specify policy • eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes • one candidate: reinforcement learning

Software techniques (4) • Proactive introspection • Continuous online self-testing of HW and SW • in deployed systems! • goal is to shake out “Heisenbugs” before they’re encountered in normal operation • needs data redundancy, node isolation, fault injection • Techniques: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: push HW/SW to their limits • scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory

Applications • ISTORE is not one super-system that demonstrates all these techniques! • Initially provide library to support AME goals • Initial application targets • cluster web/email servers • self-scrubbing data structures, online self-testing • statistical identification of normal behavior • decision-support database query execution system • River-based storage, replica management • information retrieval for multimedia data • self-scrubbing data structures, structuring performance-robust distributed computation

Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks

Methodology: reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior? • 99% confidence intervals calculated from no-fault runs • Graphs can be distilled into numbers?

Example results: software RAID-5 • Test systems: Linux/Apache and Win2000/IIS • SpecWeb ’99 to measure hits/second as QoS metric • fault injection at disks based on empirical fault data • transient, correctable, uncorrectable, & timeout faults • 15 single-fault workloads injected per system • only 4 distinct behaviors observed (A) no effect (C) RAID enters degraded mode (B) system hangs (D) RAID enters degraded mode & starts reconstruction • both systems hung (B) on simulated disk hangs • Linux exhibited (D) on all other errors • Windows exhibited (A) on transient errors and (C) on uncorrectable, sticky errors

Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not

Conclusions (1): Benchmarks • Linux and Windows take opposite approaches to managing benign and transient faults • Linux is paranoid and stops using a disk on any error • Windows ignores most benign/transient faults • Windows is more robust except when disk is truly failing • Linux and Windows have different reconstruction philosophies • Linux uses idle bandwidth for reconstruction • Windows steals app. bandwidth for reconstruction • Windows rebuilds fault-tolerance more quickly • Win2k favors fault-tolerance over performance; Linux favors performance over fault-tolerance

Conclusions (2): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks a powerful tool? • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Conclusions (3) • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Gadgets: Embedded/Mobile devices • Infrastructure: Intelligent Storage and Networks • PostPC infrastructure requires • New Goals: Availability, Maintainability, Evolution • New Principles: Introspection, Performance Robustness • New Techniques: Isolation/fault insertion, Software scrubbing • New Benchmarks: measure, compare AME metrics

Berkeley Future work • IRAM: fab and test chip • ISTORE • implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications • select the best techniques and integrate into a generic runtime system with “AME API” • add maintainability benchmarks • can we quantify administrative work needed to maintain a certain level of availability? • Perhaps look at data security via encryption? • Even consider denial of service?

The UC Berkeley IRAM/ISTORE Projects:Computers for the PostPC Era For more information: http://iram.cs.berkeley.edu/istore istore-group@cs.berkeley.edu

Backup Slides (mostly in the area of benchmarking)

Case study • Software RAID-5 plus web server • Linux/Apache vs. Windows 2000/IIS • Why software RAID? • well-defined availability guarantees • RAID-5 volume should tolerate a single disk failure • reduced performance (degraded mode) after failure • may automatically rebuild redundancy onto spare disk • simple system • easy to inject storage faults • Why web server? • an application with measurable QoS metrics that depend on RAID availability and performance

Benchmark environment: metrics • QoS metrics measured • hits per second • roughly tracks response time in our experiments • degree of fault tolerance in storage system • Workload generator and data collector • SpecWeb99 web benchmark • simulates realistic high-volume user load • mostly static read-only workload; some dynamic content • modified to run continuously and to measure average hits per second over each 2-minute interval

Benchmark environment: faults • Focus on faults in the storage system (disks) • How do disks fail? • according to Tertiary Disk project, failures include: • recovered media errors • uncorrectable write failures • hardware errors (e.g., diagnostic failures) • SCSI timeouts • SCSI parity errors • note: no head crashes, no fail-stop failures

Disk fault injection technique • To inject reproducible failures, we replaced one disk in the RAID with an emulated disk • a PC that appears as a disk on the SCSI bus • I/O requests processed in software, reflected to local disk • fault injection performed by altering SCSI command processing in the emulation software • Types of emulated faults: • media errors (transient, correctable, uncorrectable) • hardware errors (firmware, mechanical) • parity errors • power failures • disk hangs/timeouts

Computers for the Post-PC Era