Computers for the Post-PC Era

Computers for the Post-PC Era Aaron Brown, Jim Beck, Rich Martin, David Oppenheimer, Kathy Yelick, and David Patterson http://iram.cs.berkeley.edu/istore 2000 Grad Visit Day

Berkeley Approach to Systems • Find an important problem crossing HW/SW Interface, with HW/SW prototype at end, typically as part of graduate courses • Assemble a band of 3-6 faculty, 12-20 grad students, 1-3 staff to tackle it over 4 years • Meet twice a year for 3-day retreats with invited outsiders • Builds team spirit • Get advice on direction, and change course • Offers milestones for project stages • Grad students give 6 to 8 talks  Great Speakers • Write papers, go to conferences, get PhDs, jobs • End of project party, reshuffle faculty, go to 1

For Example, Projects I Have Worked On • RISC I,II • Sequin, Ousterhout (CAD) • SOAR (Smalltalk On A RISC) Ousterhout (CAD) • SPUR (Symbolic Processing Using RISCs) • Fateman, Hilfinger, Hodges, Katz, Ousterhout • RAID I,II (Redundant Array of Inexp. Disks) • Katz, Ousterhout, Stonebraker • NOW I,II (Network of Workstations), (TD) • Culler, Anderson • IRAM I (Intelligent RAM) • Yelick, Kubiatowicz, Wawrzynek • ISTORE I,II (Intelligent Storage) • Yelick, Kubiatowicz

Symbolic Processing Using RISCs: ‘85-’89 • Before Commercial RISC chips • Built Workstation Multiprocessor and Operating System from scratch(!) • Sprite Operating System • 3 chips: Processor, Cache Controller, FPU • Coined term “snopping cache protocol” • 3C’s cache miss: compulsory, capacity, conflict

Group Photo (in souvenir jackets) Jim Larus, Wisconsin, M/S George Taylor, Founder, ? David Wood,Wisconsin Dave Lee Founder Si. Image John Ouster- hout Founder, Scriptics • See www.cs.berkeley.edu/Projects/ARC to learn more about Berkeley Systems Ben Zorn Colorado, M/S Mark Hill Wisc. Mendel Rosen- blum, Stanford, Founder VMware Susan Eggers Wash-ington Brent Welch Founder, Scriptics Shing Kong Transmeta Garth Gibson CMU, Founder ?

SPUR 10 Year Reunion, January ‘99 • Everyone from North America came! • 19 PhDs: 9 to Academia • 8/9 got tenure, 2 full professors (already) • 2 Romme fellows (3rd, 4th at Wisconsin) • 3 NSF Presidential Young Investigator Winners • 2 ACM Dissertation Awards • They in turn produced 30 PhDs (1/99) • 10 to Industry • Founders of 5 startups, (1 failed) • 2 Department heads (AT&T Bell Labs, Microsoft) • Very successful group; SPUR Project “gave them a taste of success, lifelong friends”,

Network of Workstations (NOW) ‘94 -’98 • Leveraging commodity workstations and OSes to harness the power of clustered machines connected via high-speed switched networks • Construction of HW/SW prototypes: NOW-1 with 32 SuperSPARCs, and NOW-2 with 100 UltraSPARC 1s • NOW-2 cluster held the world record for the fastest Disk-to-Disk Sort for 2 years, 1997-1999 • NOW-2 cluster 1st to crack the 40-bit key as part of a key-cracking challenge offered by RSA, 1997 • NOW-2 made list of Top 200 supercomputers 1997 • NOW a foundation of Virtual Interface (VI) Architecture, standard allows protected, direct user-level access to network, by Compaq, Intel, & M/S • NOW technology led directly to one Internet startup company (Inktomi), + many other Internet companies use cluster technology

Network of Workstations (NOW) ‘94 -’98 • 12 PhDs. Note that 3/4 of them went into academia, and that 1/3 are female: • Andrea Arpaci-Desseau, Asst. Professor, Wisconsin, Madison • Remzi Arpaci-Desseau, Asst. Professor, Wisconsin, Madison • Mike Dahlin, Asst. Professor, University of Texas, Austin • Jeanna Neefe Matthews, Asst. Professor, Clarkson Univ. • Douglas Ghormley, Researcher, Los Alamos National Labs • Kim Keeton, Researcher, Hewlett Packard Labs • Steve Lumetta, Assistant Professor, Illinois • Alan Mainwaring, Researcher, Sun Microsystems Labs • Rich Martin, Assistant Professor, Rutgers University • Nisha Talagala, Researcher, Network Storage, Sun Micro. • Amin Vahdat, Assistant Professor, Duke University • Randy Wang, Assistant Professor, Princeton University

Research in Berkeley Courses • RISC, SPUR, RAID, NOW, IRAM, ISTORE all started in advanced graduate courses • Make transition from undergraduate student to researcher in first-year graduate courses • First year architecture, operating systems courses: select topic, do research, write paper, give talk • Prof meets each team 1-on-1 ~3 times, + TA help • Some papers get submitted and published • Requires class size < 40 (e.g., Berkeley) • If 1st year course size ~100 students => cannot do research in grad courses 1st year or so • If school offers combined BS/MS (e.g., MIT) or professional MS via TV broadcast (e.g., Stanford), then effective class size ~150-250

Outline • Background: Berkeley Approach to Systems • PostPC Motivation • PostPC Microprocessor: IRAM • PostPC Infrastructure Motivation • PostPC Infrastructure: ISTORE • Hardware Architecture • Software Architecture • Conclusions and Feedback

Perspective on Post-PC Era • PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices • ubiquitous: in everything • e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers

L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: • 10X capacity vs. SRAM • on-chip memory latency 5-10X, bandwidth 50-100X • improve energy efficiency 2X-4X (no off-chip bus) • serial I/O 5-10X v. buses • smaller board area/volume IRAM advantages extend to: • a single chip system • a building block for larger systems

Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years)(We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep performance Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Revive Vector Architecture

C P U+$ 4 Vector Pipes/Lanes VIRAM-1: System on a Chip • Prototype scheduled for end of Summer 2000 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) • 140 M transistors (> Intel?) Memory(64 Mbits / 8 MBytes) Xbar I/O Memory(64 Mbits / 8 MBytes)

Outline • PostPC Infrastructure Motivation and Background: Berkeley’s Past • PostPC Motivation • PostPC Device Microprocessor: IRAM • PostPC Infrastructure Motivation • ISTORE Goals • Hardware Architecture • Software Architecture • Conclusions and Feedback

Background: Tertiary Disk (part of NOW) • Tertiary Disk (1997) • cluster of 20 PCs hosting 364 3.5” IBM disks (8.4 GB) in 7 19”x 33” x 84” racks, or 3 TB. The 200MHz, 96 MB P6 PCs run FreeBSD and a switched 100Mb/s Ethernet connects the hosts. Also 4 UPS units. • Hosts world’s largest art database:80,000 images in cooperation with San Francisco Fine Arts Museum:Try www.thinker.org

Tertiary Disk HW Failure Experience Reliability of hardware components (20 months) 7 IBM SCSI disk failures (out of 364, or 2%) 6 IDE (internal) disk failures (out of 20, or 30%) 1 SCSI controller failure (out of 44, or 2%) 1 SCSI Cable (out of 39, or 3%) 1 Ethernet card failure (out of 20, or 5%) 1 Ethernet switch (out of 2, or 50%) 3 enclosure power supplies (out of 92, or 3%) 1 short power outage (covered by UPS) Did not match expectations:SCSI disks more reliable than SCSI cables! Difference between simulation and prototypes

SCSI Time Outs+ Hardware Failures (m11) SCSI Bus 0

Can we predict a disk failure? • Yes, look for Hardware Error messages • These messages lasted for 8 days between: • 8-17-98 and 8-25-98 • On disk 9 there were: • 1763 Hardware Error Messages, and • 297 SCSI Timed Out Messages • On 8-28-98: Disk 9 on SCSI Bus 0 of m11 was “fired”, i.e. appeared it was about to fail, so it was swapped

Lessons from Tertiary Disk Project • Maintenance is hard on current systems • Hard to know what is going on, who is to blame • Everything can break • Its not what you expect in advance • Follow rule of no single point of failure • Nothing fails fast • Eventually behaves bad enough that operator “fires” poor performer, but it doesn’t “quit” • Most failures may be predicted

Outline • Background: Berkeley Approach to Systems • PostPC Motivation • PostPC Microprocessor: IRAM • PostPC Infrastructure Motivation • PostPC Infrastructure: ISTORE • Hardware Architecture • Software Architecture • Conclusions and Feedback

The problem space: big data • Big demand for enormous amounts of data • today: high-end enterprise and Internet applications • enterprise decision-support, data mining databases • online applications: e-commerce, mail, web, archives • future: infrastructure services, richer data • computational & storage back-ends for mobile devices • more multimedia content • more use of historical data to provide better services • Today’s SMP server designs can’t easily scale • Bigger scaling problems than performance!

The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow

Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance • “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • compare: biological systems spend > 50% of resources on maintenance • can make up performance by scaling system

Principles for achieving AME (2) • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive (preventative) techniques to anticipate and avert problems before they happen

Hardware techniques (2) • No Central Processor Unit: distribute processing with storage • Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems • Most storage servers limited by speed of CPUs; why does this make sense? • Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? • If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage

Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors...

A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

Development techniques • Benchmarking • One reason for 1000X processor performance was ability to measure (vs. debate) which is better • e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? • Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not

Software techniques (1) • Proactive introspection • Continuous online self-testing of HW and SW • in deployed systems! • goal is to shake out “Heisenbugs” before they’re encountered in normal operation • needs data redundancy, node isolation, fault injection • Techniques: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: push HW/SW to their limits • scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory

Conclusions (1): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks a powerful tool? • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Conclusions (2) • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Gadgets: Embedded/Mobile devices • Infrastructure: Intelligent Storage and Networks • PostPC infrastructure requires • New Goals: Availability, Maintainability, Evolution • New Principles: Introspection, Performance Robustness • New Techniques: Isolation/fault insertion, Software scrubbing • New Benchmarks: measure, compare AME metrics

Berkeley Future work • IRAM: fab and test chip • ISTORE • implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications • select the best techniques and integrate into a generic runtime system with “AME API” • add maintainability benchmarks • can we quantify administrative work needed to maintain a certain level of availability? • Perhaps look at data security via encryption? • Even consider denial of service?

Computers for the Post-PC Era