Computers for the Post-PC Era

Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu 10 Feburary 2000

Perspective on Post-PC Era • PostPC Era will be driven by 2 technologies: 1) Tiny Embedded orMobile Consumer Devices • e.g., successor to PDA, cell phone, wearable computers • ubiquitous: in everything 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers

Outline 1) One instance of microprocessors for gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work

Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years)(We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep perf Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Revive Vector Architecture

I/O I/O I/O I/O V-IRAM1: Low Power v. High Perf. 4 x 64 or 8 x 32 or 16 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 16K I cache 16K D cache 4 x 64 4 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 4 x 64 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … … … M M M M M M M M M M

C P U+$ 4 Vector Pipes/Lanes VIRAM-1: System on a Chip • Prototype scheduled for tape-out mid 2000 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar I/O Memory(64 Mbits / 8 MBytes)

Media Kernel Performance

Base-line system comparison • All numbers in cycles/pixel • MMX and VIS results assume all data in L1 cache

IRAM Chip Challenges • Merged Logic-DRAM process: Cost of wafer, Impact on yield, testing cost of logic and DRAM • Price: on-chip DRAM v. separate DRAM chips? • Time delay of transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only • DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block • Applications: advantages in memory bandwidth, energy, system size to offset above challenges?

Other examples: Sony Playstation 2 • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) • Superscalar MIPS core + vector coprocessor + graphics/DRAM • Claim: “Toy Story” realism brought to games!

Other examples: IBM Blue Gene • Blue Gene Chip • 20 x 20 mm • 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single chip • 1 GFLOPS / processor • 2’ x 2’ Board = 64 chips • Tower = 8 Boards • System = 64 Towers • Total 1 million processors (25 x 26 x 23 x 26), in just 2000 sq. ft. • Cost: $100M • Goal: 1 PetaFLOPS in 2005? • Application: Protein Folding

The problem space: big data • Big demand for enormous amounts of data • today: high-end enterprise and Internet applications • enterprise decision-support, data mining databases • online applications: e-commerce, mail, web, archives • future: infrastructure services, richer data • computational & storage back-ends for mobile devices • more multimedia content • more use of historical data to provide better services • Today’s server designs can’t easily scale to meet these huge demands • bus bandwidth bottlenecks limit access to stored data • SMP designs are near their limits and don’t offer incremental growth path

One approach: traditional NAS • Network-attached storage makes storage devices first-class citizens on the network • network file server appliances (NetApp, SNAP, ...) • storage-area networks (CMU NASD, NSIC OOD, ...) • active disks (CMU, UCSB, Berkeley IDISK) • These approaches primarily target performance scalability • scalable networks remove bus bandwidth limitations • migration of layout functionality to storage devices removes overhead of intermediate servers • There are bigger scaling problems than scalable performance!

The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow

The ISTORE project vision • Our goal: develop principles and investigate hardware/software techniques for building storage-based server systems that: • are highly available • require minimal maintenance • robustly handle evolutionary growth • are scalable to O(10000) nodes

Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance • “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • compare: biological systems spend > 50% of resources on maintenance • can make up performance by scaling system

Principles for achieving AME (2) • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen • Benchmarking • quantification brings rigor • requires new AME benchmarks “what gets measured gets done” “benchmarks shape a field”

Hardware techniques • Fully shared-nothing cluster organization • truly scalable architecture • architecture that can tolerate partial failure • automatic hardware redundancy • Storage distributed with computation nodes • distributed processing reduces data movement and avoids network bottlenecks • nodes are responsible for the health of the storage that they own • if AME is important, must provide resources to be used for AME

Hardware techniques (2) • Heavily instrumented hardware • sensors for temp, vibration, humidity, power, intrusion • helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node • provides remote control of power, remote console access to the node, selection of node boot code • collects, stores, processes environmental data for abnormalities • non-volatile “flight recorder” functionality • all diagnostic processors connected via independent diagnostic network

Hardware techniques (3) • Built-in fault injection capabilities • power control to individual node components • injectable glitches into I/O and memory busses • on-demand network partitioning/isolation • managed by diagnostic processor and network switches via diagnostic network • used for proactive hardware introspection • automated detection of flaky components • controlled testing of error-recovery mechanisms • important for AME benchmarking

Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mb/s • 2 1 Gb/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibrartion sensors... Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster

ISTORE Brick Block Diagram Mobile Pentium II Module SCSI North Bridge CPU Disk (18 GB) South Bridge Diagnostic Net DUAL UART DRAM 256 MB Super I/O Monitor & Control Diagnostic Processor BIOS Ethernets 4x100 Mb/s PCI • Sensors for heat and vibration • Control over power to individual nodes Flash RTC RAM

A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • 10,000+ nodes fit into one rack! • This scale is our ultimate design point

Software techniques • Fully-distributed, shared-nothing code • centralization breaks as systems scale up O(10000) • avoids single-point-of-failure front ends • Redundant data storage • required for high availability, simplifies self-testing • replication at the level of application objects • application can control consistency policy • more opportunity for data placement optimization

Software techniques (2) • “River” storage interfaces • NOW Sort experience: performance heterogeneity is the norm • disks: inner vs. outer track (50%), fragmentation • processors: load (1.5-5x) • So demand-driven delivery of data to apps • via distributed queues and graduated declustering • for apps that can handle unordered data delivery • automatically adapts to variations in performance of producers and consumers

Software techniques (3) • Reactive introspection • use statistical techniques to identify normal behavior and detect deviations from it • policy-driven automatic adaptation to abnormal behavior once detected • initially, rely on human administrator to specify policy • eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes • one candidate: reinforcement learning

Software techniques (4) • Proactive introspection • continuous online self-testing of HW and SW • in deployed systems! • goal is to shake out “Heisenbugs” before they’re encountered in normal operation • needs data redundancy, node isolation, fault injection • techniques: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: push HW/SW to their limits • scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory

Applications • ISTORE is not one super-system that demonstrates all these techniques! • Initially provide library to support AME goals • Initial application targets • cluster web/email servers • self-scrubbing data structures, online self-testing • statistical identification of normal behavior • decision-support database query execution system • River-based storage, replica management • information retrieval for multimedia data • self-scrubbing data structures, structuring performance-robust distributed computation

Availability benchmarks • Questions to answer • what factors affect the quality of service delivered by the system, and by how much/how long? • how well can systems survive typical failure scenarios? • Availability metrics • traditionally, percentage of time system is up • time-averaged, binary view of system state (up/down) • traditional metric is too inflexible • doesn’t capture spectrum of degraded states • time-averaging discards important temporal behavior • Solution: measure variation in system quality of service metrics over time • performance, fault-tolerance, completeness, accuracy

Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks

Methodology: reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior • 99% confidence intervals calculated from no-fault runs • Graphs can be distilled into numbers • quantify distribution of deviations from normal behavior, compute area under curve for deviations, ...

Example results: software RAID-5 • Test systems: Linux/Apache and Win2000/IIS • SpecWeb ’99 to measure hits/second as QoS metric • fault injection at disks based on empirical fault data • transient, correctable, uncorrectable, & timeout faults • 15 single-fault workloads injected per system • only 4 distinct behaviors observed (A) no effect (C) RAID enters degraded mode (B) system hangs (D) RAID enters degraded mode & starts reconstruction • both systems hung (B) on simulated disk hangs • Linux exhibited (D) on all other errors • Windows exhibited (A) on transient errors and (C) on uncorrectable, sticky errors

Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not

Conclusions • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Mobile consumer electronic devices • Scaleable infrastructure • IRAM benchmarking result: faster than DSPs • ISTORE: hardware/software architecture for large scale network services • Scaling systems requires • new continuous models of availability • performance not limited by the weakest link • self* systems to reduce human interaction

Benchmark conclusions • Linux and Windows take opposite approaches to managing benign and transient faults • Linux is paranoid and stops using a disk on any error • Windows ignores most benign/transient faults • Windows is more robust except when disk is truly failing • Linux and Windows have different reconstruction philosophies • Linux uses idle bandwidth for reconstruction • Windows steals app. bandwidth for reconstruction • Windows rebuilds fault-tolerance more quickly • Win2k favors fault-tolerance over performance; Linux favors performance over fault-tolerance

ISTORE conclusions • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks are a powerful tool • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Future work • ISTORE • implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications • select the best techniques and integrate into a generic runtime system with “AME API” • AME benchmarks • expand availability benchmarks to distributed apps • add maintainability • use methodology from availability benchmark • but include administrator’s response to faults • must develop model of typical administrator behavior • can we quantify administrative work needed to maintain a certain level of availability?

The UC Berkeley ISTORE Project:bringing availability, maintainability, and evolutionary growth to storage-based clusters For more information: http://iram.cs.berkeley.edu/istore istore-group@cs.berkeley.edu

Backup Slides (mostly in the area of benchmarking)

Case study • Software RAID-5 plus web server • Linux/Apache vs. Windows 2000/IIS • Why software RAID? • well-defined availability guarantees • RAID-5 volume should tolerate a single disk failure • reduced performance (degraded mode) after failure • may automatically rebuild redundancy onto spare disk • simple system • easy to inject storage faults • Why web server? • an application with measurable QoS metrics that depend on RAID availability and performance

Benchmark environment: metrics • QoS metrics measured • hits per second • roughly tracks response time in our experiments • degree of fault tolerance in storage system • Workload generator and data collector • SpecWeb99 web benchmark • simulates realistic high-volume user load • mostly static read-only workload; some dynamic content • modified to run continuously and to measure average hits per second over each 2-minute interval

Benchmark environment: faults • Focus on faults in the storage system (disks) • How do disks fail? • according to Tertiary Disk project, failures include: • recovered media errors • uncorrectable write failures • hardware errors (e.g., diagnostic failures) • SCSI timeouts • SCSI parity errors • note: no head crashes, no fail-stop failures

Disk fault injection technique • To inject reproducible failures, we replaced one disk in the RAID with an emulated disk • a PC that appears as a disk on the SCSI bus • I/O requests processed in software, reflected to local disk • fault injection performed by altering SCSI command processing in the emulation software • Types of emulated faults: • media errors (transient, correctable, uncorrectable) • hardware errors (firmware, mechanical) • parity errors • power failures • disk hangs/timeouts

IBM18 GB10k RPM Server Disk Emulator IDEsystemdisk SCSIsystemdisk Adaptec2940 UltraSCSI EmulatedDisk Adaptec2940 IBM18 GB10k RPM emulatorbacking disk(NTFS) IBM18 GB10k RPM Adaptec2940 Adaptec2940 Adaptec2940 AdvStorASC-U2W IBM18 GB10k RPM EmulatedSpareDisk AMD K6-2-33364 MB DRAMLinux or Win2000 AMD K6-2-350Windows NT 4.0ASC VirtualSCSI lib. RAIDdata disks = Fast/Wide SCSI bus, 20 MB/sec System configuration • RAID-5 Volume: 3GB capacity, 1GB used per disk • 3 physical disks, 1 emulated disk, 1 emulated spare disk • 2 web clients connected via 100Mb switched Ethernet

Results: single-fault experiments • One exp’t for each type of fault (15 total) • only one fault injected per experiment • no human intervention • system allowed to continue until stabilized or crashed • Four distinct system behaviors observed (A) no effect: system ignores fault (B) RAID system enters degraded mode (C) RAID system begins reconstruction onto spare disk (D) system failure (hang or crash)

System behavior: single-fault (A) no effect (B) enter degraded mode (C) begin reconstruction (D) system failure

System behavior: single-fault (2) • Windows ignores benign faults • Windows can’t automatically rebuild • Linux reconstructs on all errors • Both systems fail when disk hangs

Computers for the Post-PC Era