ISTORE Update

ISTORE Update David Patterson University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu May 2000

Lampson: Systems Challenges • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter “Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft

Hennessy: What Should the “New World” Focus Be? • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow

Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance • “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • compare: biological systems spend > 50% of resources on maintenance • can make up performance by scaling system

Principles for achieving AME (2) • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen

Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors...

ISTORE-1 Status • 10 Nodes manufactured; 45 board fabbed, 40 to go • Boots OS • Diagnostic Processor Interface SW complete • PCB backplane: not yet designed • Finish 80 node system: Summer 2000

Hardware techniques • Fully shared-nothing cluster organization • truly scalable architecture • architecture that tolerates partial failure • automatic hardware redundancy

Hardware techniques (2) • No Central Processor Unit: distribute processing with storage • Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems • Most storage servers limited by speed of CPUs; why does this make sense? • Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? • If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage

Hardware techniques (3) • Heavily instrumented hardware • sensors for temp, vibration, humidity, power, intrusion • helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node • provides remote control of power, remote console access to the node, selection of node boot code • collects, stores, processes environmental data for abnormalities • non-volatile “flight recorder” functionality • all diagnostic processors connected via independent diagnostic network

Hardware techniques (4) • On-demand network partitioning/isolation • Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance • Allows testing, repair of online system • Managed by diagnostic processor and network switches via diagnostic network

Hardware techniques (5) • Built-in fault injection capabilities • Power control to individual node components • Injectable glitches into I/O and memory busses • Managed by diagnostic processor • Used for proactive hardware introspection • automated detection of flaky components • controlled testing of error-recovery mechanisms • Important for AME benchmarking (see next slide)

“Hardware” techniques (6) • Benchmarking • One reason for 1000X processor performance was ability to measure (vs. debate) which is better • e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? • Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks

Benchmark Availability?Methodology for reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior? • 99% confidence intervals calculated from no-fault runs

Example single-fault result • Compares Linux and Solaris reconstruction • Linux: minimal performance impact but longer window of vulnerability to second fault • Solaris: large perf. impact but restores redundancy fast Linux Solaris

Reconstruction Policy • Linux: favors performance over data availability • automatically-initiated reconstruction, idle bandwidth • virtually no performance impact on application • very long window of vulnerability (>1hr for 3GB RAID) • Solaris: favors data availability over app. perf. • automatically-initiated reconstruction at high BW • as much as 34% drop in application performance • short window of vulnerability (10 minutes for 3GB) • Windows: favors neither! • manually-initiated reconstruction at moderate BW • as much as 18% app. performance drop • somewhat short window of vulnerability (23 min/3GB)

Transient error handling Policy • Linux is paranoid with respect to transients • stops using affected disk (and reconstructs) on any error, transient or not • fragile: system is more vulnerable to multiple faults • disk-inefficient: wastes two disks per transient • but no chance of slowly-failing disk impacting perf. • Solaris and Windows are more forgiving • both ignore most benign/transient faults • robust: less likely to lose data, more disk-efficient • less likely to catch slowly-failing disks and remove them • Neither policy is ideal! • need a hybrid that detects streams of transients

Software techniques • Fully-distributed, shared-nothing code • centralization breaks as systems scale up O(10000) • avoids single-point-of-failure front ends • Redundant data storage • required for high availability, simplifies self-testing • replication at the level of application objects • application can control consistency policy • more opportunity for data placement optimization

Software techniques (2) • “River” storage interfaces • NOW Sort experience: performance heterogeneity is the norm • e.g., disks: outer vs. inner track (1.5X), fragmentation • e.g., processors: load (1.5-5x) • So demand-driven delivery of data to apps • via distributed queues and graduated declustering • for apps that can handle unordered data delivery • Automatically adapts to variations in performance of producers and consumers • Also helps with evolutionary growth of cluster

Software techniques (3) • Reactive introspection • Use statistical techniques to identify normal behavior and detect deviations from it • Policy-driven automatic adaptation to abnormal behavior once detected • initially, rely on human administrator to specify policy • eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes • one candidate: reinforcement learning

Software techniques (4) • Proactive introspection • Continuous online self-testing of HW and SW • in deployed systems! • goal is to shake out “Heisenbugs” before they’re encountered in normal operation • needs data redundancy, node isolation, fault injection • Techniques: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: push HW/SW to their limits • scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory

Initial Applications • ISTORE is not one super-system that demonstrates all these techniques! • Initially provide middleware, library to support AME goals • Initial application targets • cluster web/email servers • self-scrubbing data structures, online self-testing • statistical identification of normal behavior • information retrieval for multimedia data • self-scrubbing data structures, structuring performance-robust distributed computation

A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • If low power, 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

Future Targets • Maintenance in DoD application • Security in Computer Systems • Computer Vision

Maintenance in DoD systems • Introspective Middleware, Builtin Fault Injection, Diagnostic Computer, Isolatable Subsystems ... should reduce Maintenance of DoD Hardware and Software systems • Is Maintenance a major concern of DoD? • Does Improved Maintenance fit within Goals of Polymorphous Computing Architecture?

Security in DoD Systems? • Separate Diagnostic Processor and Network gives interesting Security possibilities • Monitoring of behavior by separate computer • Isolation of portion of cluster from rest of network • Remote reboot, software installation

Attacking Computer Vision • Analogy: Computer Vision Recognition in 2000 like Computer Speech Recognition in 1985 • Pre 1985 community searching for good algorithms: classic AI vs. statistics? • By 1985 reached consensus on statistics • Field focuses and makes progress, uses special hardware • Systems become fast enough that can train systems rather than preload information, which accelerates progress • By 1995 speech regonition systems starting to deploy • By 2000 widely used, available on PCs

Computer Vision at Berkeley • Jitendra Malik believes has an approach that is very promising • 2 step process: 1) Segmentation: Divide image into regions of coherent color, texture and motion 2) Recognition: combine regions and search image database to find a match • Algorithms for 1) work well, just slowly (300 seconds per image using PC) • Algorithms for 2) being tested this summer using hundreds of PCs; will determine accuracy

Human Quality Computer Vision • Suppose Algorithms Work: What would it take to match Human Vision? • At 30 images per second: segmentation • Convolution and Vector-Matrix Multiply of Sparse Matrices (10,000 x 10,000, 10% nonzero/row) • 32-bit Floating Point • 300 seconds on PC (assuming 333 MFLOPS) => 100G FL Ops/image • 30 Hz => 3000 GFLOPs machine to do segmentation

Human Quality Computer Vision • At 1 / second: object recognition • Human can remember 10,000 to 100,000 objects per category (e.g., 10k faces, 10k Chinese characters, high school vocabulary of 50k words, ..) • To recognize a 3D object, need ~10 2D views • 100 x 100 x 8 bit (or fewer bits) per view=> 10,000 x 10 x 100 x 100 bytes or 109 bytes • Pruning using color and texture and by organizing shapes into an index reduce shape matches to 1000 • Compare 1000 candidate merged regions with 1000 candidate object images • If 10 hours on PC (333 MFLOPS) => 12000 GFLOPS

ISTORE Successor does Human Quality Vision? • 10,000 nodes with System-On-A-Chip + Microdrive + network • 1 to 10 GFLOPS/node => 10,000 to 100,000 GFLOPS • High Bandwidth Network • 1 to 10 GB of Disk Storage per Node => can replicate images per node • Need Dependability, Maintainability advances to keep 10,000 nodes useful • Human quality vision useful for DoD Apps? Retrainable recognition?

Conclusions (1): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks a powerful tool? • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

ISTORE Update

ISTORE Update

Presentation Transcript

ISTORE Software Runtime Architecture

Update

Update

A Tentative Proposal for ISTORE-2

ISTORE Update

UPDATE

Update

Communications in ISTORE

FCAST update TESLA update

CS 294-8 ISTORE: Hardware Overview and Software Challenges cs.berkeley/~yelick/294

ISTORE Overview

IRAM and ISTORE

ISTORE-1 Update

IRAM and ISTORE Projects

ISTORE: An Introspective Storage Architecture for Network Service Applications

Top 3 Language Translation Apps on iStore

IRAM and ISTORE Projects

ISTORE Overview

IRAM and ISTORE Projects

Update, Update