Recent Progress on Scalable Servers and Supercomputers

Recent Progress on Scaleable ServersJim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays of commodity processors, disks, and networks into a cluster that provides a single system image. True, vector-supers still are 10x faster than commodity processors on certain floating point computations, but they cost disproportionately more. Indeed, the highest-performance computations are now performed by processor arrays. In the broader context of business and internet computing, processor arrays long ago surpassed mainframe performance, and for a tiny fraction of the cost. This talk first reviews this history and describes the current landscape of scaleable servers in the commercial, internet, and scientific segments. The talk then discusses the Achilles heels of scaleable systems: programming tools and system management. There has been relatively little progress in either area. This suggests some important research areas for computer systems research.

Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: • many little & few big • Many little has credible programming model • tp, web, mail, fileserver,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects

SMP Super Server Departmental Server Personal System ScaleabilityScale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

Hardware commodity processors nUMA Smart Storage SAN/VIA Software Directory Services Security Domains Process/Data migration Load balancing Fault tolerance RPC/Objects Streams/Rivers Key Technologies

MAPS - The Problems • Manageability: N machines are N times harder to manage • Availability: N machines fail N times more often • Programmability: N machines are 2N times harder to program • Scaleability: N machines cost N times more but do little more work.

Manageability • Goal: Systems self managing • N systems as easy to manage as one system • Some progress: • Distributed name servers (gives transparent naming) • Distributed security • Auto cooling of disks • Auto scheduling and load balancing • Global event log (reporting) • Automate most routine tasks • Still very hard and app-specific

Availability • Redundancy allows failover/migration (processes, disks, links) • Good progress on technology (theory and practice) • Migration also good for load balancing • Transaction concept helps exception handling

Programmability & Scaleability • That’s what the rest of this talk is about • Success on embarrassingly parallel jobs • file server, mail, transactions, web, crypto • Limited success on “batch” • relational DBMs, PVM,..

Scaleup Has Limits(chart courtesy of Catharine Van Ingen) • Vector Supers ~ 10x supers • 3 ~ GFlops • bus/memory ~ 20 GBps • IO ~ 1GBps • Supers ~ 10x PCs • 300 ~ Mflops • bus/memory ~ 2 GBps • IO ~ 1 GBps • PCs are slow • 30 ~ Mflops • and bus/memory ~ 200MBps • and IO ~ 100 MBps

Loki: Pentium Clusters for Sciencehttp://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beowulf.html • Scientists want cheap mips.

Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LANL/Cray: ? Maui Supercomputer Center 512x1 SP2 Your Tax Dollars At WorkASCI for Stockpile Stewardship

TOP500 Systems by Vendor(courtesy of Larry Smarr NCSA) 500 Other Japanese Vector Machines Other DEC 400 Intel Japanese TMC Sun DEC Intel HP 300 TMC IBM Sun Number of Systems Convex HP 200 Convex SGI IBM SGI 100 CRI CRI 0 Jun-93 Jun-95 Jun-96 Jun-98 Jun-94 Jun-97 Nov-93 Nov-95 Nov-96 Nov-94 Nov-97 TOP500 Reports: http://www.netlib.org/benchmark/top500.html

NCSA Super Cluster • National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

A Variety of Discipline Codes -Single Processor Performance Origin vs. T3EnUMA vs UMA(courtesy of Larry Smarr NCSA)

Basket of Applications Average Performance as Percentage of Linpack Performance(courtesy of Larry Smarr NCSA) 22% Application Codes: CFD Biomolecular Chemistry Materials QCD 25% 19% 14% 33% 26%

Uniprocessor RAP << PAP real app performance << peak advertised performance Growth has slowed (Bell Prize 1987: 0.5 GFLOPS 1988 1.0 GFLOPS 1 year 1990: 14 GFLOPS 2 years 1994: 140 GFLOPS 4 years 1998: 604 GFLOPS xxx: 1 TFLOPS 5 years? Time Gap = 2N-1 or 2N-1 where N =( log(performance)-9) Observations

“Commercial” Clusters • 16-node Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)

Oracle/NT • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks=2.7 TB

24 cpu, 384 disks (=2.7TB)

Building 11 Staging Servers (7) Ave CFG: 4xP6, Internal WWW Ave CFG: 4xP5, European Data Center premium.microsoft.com IDC Staging Servers 512 RAM, www.microsoft.com 30 GB HD (1) MOSWest (3) Ave CFG: 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Servers 512 RAM, SQLNet 30 GB HD Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Router Download 30 GB HD DMZ Staging Servers Ave CFG: Replication 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Router 1 GB RAM, Live SQL Servers 160 GB HD Download Server 160 GB HD SQL Reporting Ave Cost: $83K Ave CFG: 4xP6, (1) MOSWest Switched Ave CFG: FY98 Fcst: 4xP6, 2 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD 512 RAM, 160 GB HD Ave Cost: $83K 50 GB HD FY98 Fcst: 12 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com 50 GB HD Ave CFG: premium.microsoft.com 4xP6, (2) (3) 512 RAM, Ave CFG: 4xP6, (1) 30 GB HD home.microsoft.com 512 RAM, Ave CFG: 4xP6, home.microsoft.com Ave CFG: 4xP6, Ave Cost: $28K 160 GB HD FDDI Ring 512 RAM, (3) 512 RAM, FY98 Fcst: (4) 7 (MIS2) 50 GB HD premium.microsoft.com 30 GB HD Ave CFG: 4xP6 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, 30 GB HD Switched (MIS1) 512 RAM, Ave CFG: 4xP6, Ethernet 30 GB HD 256 RAM, 30 GB HD FTP Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) (2) Router Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router FTP.microsoft.com (45 Mb/Sec Each) (3) FDDI Ring Ave CFG: 4xP5, home.microsoft.com (MIS3) www.microsoft.com msid.msn.com 512 RAM, (2) 30 GB HD (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD register.microsoft.com home.microsoft.com support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD Microsoft.com: ~150x4 nodes (3)

Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware

Total Average Peak 71 Hits 913 m 10.3 m 29 m Queries 735 m 8.0 m 18 m Images 359 m 3.0 m 9 m Page Views 405 m 5.0 m 9 m TerraServer: ExampleLots of Web Hits • 1 TB, largest SQL DB on the Web • 99.95% uptime since 1 July 1998 • No downtime in August • No NT failures (ever) • most downtime is for SQL software upgrades

HotMail: ~400 Computers

Two Generic Kinds of computing • Many little • embarrassingly parallel • Fit RPC model • Fit partitioned data and computation model • Random works OK • OLTP, File Server, Email, Web,….. • Few big • sometimes not obviously parallel • Do not fit RPC model (BIG rpcs) • Scientific, simulation, data mining, ...

Many Little Programming Model • many small requests • route requests to data • encapsulate data with procedures (objects) • three-tier computing • RPC is a convenient/appropriate model • Transactions are a big help in error handling • Auto partition (e.g. hash data and computation) • Works fine. • Software CyberBricks

Object Oriented ProgrammingParallelism From Many Little Jobs • Gives location transparency • ORB/web/tpmon multiplexes clients to servers • Enables distribution • Exploits embarrassingly parallel apps (transactions) • HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server

Few Big Programming Model • Finding parallelism is hard • Pipelines are short (3x …6x speedup) • Spreading objects/data is easy, but getting locality is HARD • Mapping big job onto cluster is hard • Scheduling is hard • coarse grained (job) and fine grain (co-schedule) • Fault tolerance is hard

Kinds of Parallel Execution Any Any Sequential Sequential Pipeline Program Program Sequential Sequential Partition outputs split N ways inputs merge M ways Any Any Sequential Sequential Sequential Sequential Program Program

Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 100 second SCAN. BANDWIDTH Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Why are Relational OperatorsSuccessful for Parallelism? • Relational data model uniform operators • on uniform data stream • Closed under composition • Each operator consumes 1 or 2 input streams • Each stream is a uniform collection of data • Sequential data in and out: Pure dataflow • partitioning some operators (e.g. aggregates, non-equi-join, sort,..) • requires innovation • AUTOMATIC PARALLELISM

Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution)

SQL a Non-Procedural Programming Language • SQL: functional programming language describes answer set. • Optimizer picks best execution plan • Picks data flow web (pipeline), • degree of parallelism (partitioning) • other execution parameters (process placement, memory,...) Execution Planning Monitor Schema Executors Plan GUI Optimizer Rivers

Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism

N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows

Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image • Assign one process per processor/disk: • find images with right data & location • analyze image, if 70% snow, return it Landsat Answer date loc image image 33N 120W . . . . . . . 34N 120W 1/2/72 . . . . . .. . . 4/8/95 date, location, & image tests

N X M Data Streams M Consumers N producers River Data Rivers: Split + Merge Streams • Producers add records to the river, • Consumers consume records from the river • Purely sequential programming. • River does flow control and buffering • does partition and merge of data records • River = Split/Merge in Gamma = Exchange operator in Volcano /SQL Server.

Generalization: Object-oriented Rivers • Rivers transport sub-class of record-set (= stream of objects) • record type and partitioning are part of subclass • Node transformers are data pumps • an object with river inputs and outputs • do late-binding to record-type • Programming becomes data flow programming • specify the pipelines • Compiler/Scheduler does data partitioning and “transformer” placement

NT Cluster Sort as a Prototype • Using • data generation and • sort as a prototypical app • “Hello world” of distributed processing • goal: easy install & execute

PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec

Remote Install • Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx()

MULT_QI COSERVERINFO HANDLE HANDLE HANDLE Sort() Sort() Sort() Cluster StartupExecution • Setup : • MULTI_QI struct • COSERVERINFO struct • CoCreateInstanceEx() • Retrieve remote object handle • from MULTI_QI struct • Invoke methods as usual

AAA AAA AAA AAA AAA AAA BBB BBB BBB BBB BBB BBB CCC CCC CCC CCC CCC CCC Cluster Sort Conceptual Model • Multiple Data Sources • Multiple Data Destinations • Multiple nodes • Disks -> Sockets -> Disk -> Disk A AAA BBB CCC B C AAA BBB CCC AAA BBB CCC

Summary • Clusters of Hardware CyberBricks • all nodes are very intelligent • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. • Software CyberBricks • standard way to interconnect intelligent nodes • needs execution model • partition & pipeline • RPC and Rivers) • needs parallelism

Recent Progress on Scaleable ServersJim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays of commodity processors, disks, and networks into a cluster that provides a single system image. True, vector-supers still are 10x faster than commodity processors on certain floating point computations, but they cost disproportionately more. Indeed, the highest-performance computations are now performed by processor arrays. In the broader context of business and internet computing, processor arrays long ago surpassed mainframe performance, and for a tiny fraction of the cost. This talk first reviews this history and describes the current landscape of scaleable servers in the commercial, internet, and scientific segments. The talk then discusses the Achilles heels of scaleable systems: programming tools and system management. There has been relatively little progress in either area. This suggests some important research areas for computer systems research.

end

What I’m Doing • TerraServer: Photo of the planet on the web • a database (not a file system) • 1TB now, 15 PB in 10 years • http://www.TerraServer.microsoft.com/ • Sloan Digital Sky Survey: picture of the universe • just getting started, cyberbricks for astronomers • http://www.sdss.org/ • Sorting: • one node pennysort (http://research.microsoft.com/barc/SortBenchmark/) • multinode: NT Cluster sort (shows off SAN and DCOM)

What I’m Doing • NT Clusters: • failover: Fault tolerance within a cluster • NT Cluster Sort: balanced IO, cpu, network benchmar • AlwaysUp: Geographical fault tolerance. • RAGS: random testing of SQL systems • a bug finder • Telepresence • Working with Gordon Bell on “the killer app” • FileCast and PowerCast • Cyberversity (international, on demand, free university)

Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: • many little & few big • Many little has credible programming model • tp, web, fileserver, mail,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects

Recent Progress on Scalable Servers and Supercomputers

Recent Progress on Scalable Servers and Supercomputers

Presentation Transcript

Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research Gray@Microsoft research.Micro

CFS/FM: Recent Research Progress

Recent progress on MIMOSA sensors

Recent progress on SOIPIX project

Recent Progress in ICRF Research on Alcator C-Mod

Recent spatial work by Jim Gray and Alex Szalay

Recent Progress on ISAT

Recent progress

Recent progress on CLIC_DDS

Recent Progress

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

Recent progress

Jim Gray Microsoft Research Alex Szalay Johns Hopkins University

Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft

Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

Microsoft Research Jim Gray Researcher Microsoft Research Microsoft Corporation

A new world record? Jim Gray Microsoft Research