Berkeley RAD Lab: Robust, Adaptive, Distributed Systems

Berkeley RAD Lab:Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005

RAD Lab The 5-year Vision: Single person can go from vision to a next-generation IT service (“the Fortune 1 million”) • E.g., over long holiday weekend in 1995, Pierre Omidyar created Ebay v1.0 The Vehicle: Interdisciplinary Center creates core technical competency to demo 10X to 100X • Researchers are leaders in machine learning, networking, and systems • Industrial Participants: leading companies in HW, systems SW, and online services • Called “RAD Lab” for Reliable, Adaptable, Distributed systems

Cap: Dado: (The section of a pedestal between cap and base) Base: RAD Lab The Science: Both shorter-term and longer-term solutions • Develop using primitives  functions (MapReduce), services (Craigslist) • Assess/debug using deterministic replay and finding new metrics • Deploy using “Internet-in-a-Box” via FPGAs under failure/slowdown workloads • Operate using Statistical Learning Theory-friendly, Control Theory-friendly software architectures and visualization tools Added Value to Industrial Participants: • Working with leading people and companies from different industries on long-range, pre-competitive technology • Training of dozens of future leaders of IT in multiple disciplines, and their recruitment by industrial participants • Working with researchers with successful track record of rapid transfer of new technology

Process: SupportDADO Evolution, 1 group Steps: Traditional, Static Handoff Model, N groups Assess Deploy Assess Deploy Develop Operate Develop Operate Steps vs. Process

DADO - Develop • Create abstractions, primitives, & toolkit for large scale systems that make it easy to invent/deploy functions (e.g, MapReduce) • For example, Distributed Hash Tables (OpenDHT) • Already setting the trend for IETF standards

DADO - Assess • “We improve what we can measure” • Inspect box visibility into networks, usually data poor • Servers data rich; data often discarded • Statistical and Machine Learning (SML) to the rescue. It works well when • You have lots of raw data • You have reason to believe the raw data is related to some high-level effect you’re interested in • You don’t have a model of what that relationship is • Note: SML advances  fast analysis

DADO - Deploy • Re-engineer RAMP to act like 1000+ node distributed system under realistic failure and slowdown workloads • RAMP emulates data center & wide area systems as well as MPP • Collect and apply failure data from real world • RAMP vs. Clusters: Larger scale, easier to develop/debug, flexible HW/SW configuration, inexpensive so no need to share • Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic

DADO - Operate • Idea: when site misbehaves, users notice, and change their behavior; use as “failure detector” • Approach: combine visualization with Statistical and Machine Learning analysis so operator see anomalies too • Experiment: does distribution of hits to various pages match the “historical” distribution? • Each minute, compare hit counts of top N pages to hit counts over last 6 hours using Bayesian networks and 2test, real Ebates data To learn more, see “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.

0 11:07am start of anomalies 11:33am – 11:56am site crash anomalyscore Account page problem Novel Visualization “I see and understand” Winning operator trust

Founding the RADLab; Start 12/1 • Looking for 3 to 5 founding companies to fund 5 years @ cost of $0.5M / year • 25 grad students + 15 undergrads+ 6 faculty + 2 staff • Founding companies: Google, Microsoft, Sun Microsystems • RADS Consortium model • Preference to founding partner technology in prototypes • Designate employees to act as consultants • Head start for participants on research results • Putting IP in Public Domain so partners use & not sued • Press release of founding RAD Lab partners December 1? • Mid project review after 3 years by founding partners

RAD Lab Opportunity: New Research Model • Chance to Partner with the Top University in Computer Systems on the “Next Great Thing” • National Academy of Engineering mentions Berkeley in 7 of 19 $1B+ industries that came from IT research • NAE mentions Berkeley 7 times, Stanford 5 Times, MIT 5, CMU 3 Timesharing (SDS 940), Client-Server Computing (BSD Unix), Graphics, Entertainment, Internet, LANs, Workstations, GUI, VLSI Design (Spice) [ECAD $5B?/yr] , RISC [$10B?/yr] , Relational DB (Ingres/Postgres) [RDB $15B?/yr], Parallel DB, Data Mining, Parallel Computing, RAID [$15B?/yr] , Portable Communication (BWRC), WWW, Speech Recognition, Broadband • Berkeley one of the top suppliers of systems students to industry and academia • US News & World Report ranking of CS Systems universities: 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford

Capability (Desired): 1 person can invent & run the next-gen IT service Develop using primitives to enable functions and services Assess using deterministic replay and statistical and machine learning (SML) Deploy via “Internet-in-a-Box” FPGAs Operate SML-friendly, Control Theory-friendly architectures and operator-centric visualization and analysis tools Base Technology: Server Hardware, System Software, Middleware, Networking RAD Lab: Interdisciplinary Center for Reliable, Adaptive, Distributed Systems • Working with different industries on long-range, pre-competitive technology • Training of dozens of future leaders of IT, plus their recruitment • Working with researchers with track records of successful technology transfer

Backup Slides

References To learn more, see • “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson. • “Microreboot -- A Technique for Cheap Recovery,” George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Proc. 6th Symp. on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004. • “Path-Based Failure and Evolution Management,” Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer In Proc. 1st USENIX/ACM Symp. on Networked Systems Design and Implementation (NSDI '04), San Francisco, CA, March 2004. • "Scalable Statistical Bug Isolation," Ben Liblit, M. Naik, Alice. X. Zheng, Alex Aiken, and Micheal I. Jordan, PLDI, 2005.

Sustaining Innovation/Training Engine in 21st Century • Replicate research centers based primarily on industrial funding to expand IT market and to train next generation of IT leaders • Berkeley Wireless Research Center (BWRC): 50 grad students, 30 undergrads @ $5M per year • Stanford Network Research Center (SNRC): 50 Grad students @ $5M per year • MIT Tparty $4M per year (100% $ from Quanta) • Industry largely funds • N companies, where N is 5? • Exciting, long term technical vision • Demonstrated by prototype(s)

State of Research Funding Today • Most industry research shorter term • DARPA exiting long-term (exp.) IT research • ’03-’05 BAAs IPTO: 9 AI, 2 classified, 1 SW radio, 1 sensor net, 1 reliability, all have 12 to 18 month “go/no go” milestones • Academic led funding reduced 50% (so far) 2001 to 2004 • Faculty ≈ consultants in consortia led by defense contractor, get grants ≈ support 1-2 students (~ NSF funding level) • NSF swamped with proposals, conservative • 2000 to 6500 proposals in 5 years • IT has lowest acceptance rate at NSF (between 8% to 16%) • “Ambitious proposal” is a negative review • Even if get NSF funding, proposal reduced to stretch NSF $ e.g., got 3 x 1/3 faculty, 6 grad students, 0 staff, 3 years • (To learn more, see www.cra.org/research)

RAD Lab Timeline • 2005 Launch RAD Lab 12/1 • 2006 Collect workloads, Internet in a Box • 2007 SLT/CT distributed architectures, Iboxes, annotative layer, class testing • 2008 Development toolkit 1.0, tuple space, class testing; Mid Project Review • 2009 RAD Lab software suite 1.0, class testing • 2010 End of Project Party

Berkeley RAD Lab: Robust, Adaptive, Distributed Systems