Evolution of Computing: From Processing Power to Automatic Management

What Happens WhenProcessingStorageBandwidth are Free and Infinite? Jim Gray Microsoft Research

Outline • Clusters of Hardware CyberBricks • all nodes are very intelligent • Software CyberBricks • standard way to interconnect intelligent nodes • What next? • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system.

When Computers & Communication are Free • Traditional computer industry is 0 B$/year • All the costs are in • Content (good) • System Management (bad) • A vendor claims it costs 8$/MB/year to manage disk storage. • => WebTV (1GB drive) costs 8,000$/year to manage! • => 10 PB DB costs 80 Billion $/year to manage! • Automatic management is ESSENTIAL • In the mean time….

1980 Rule of Thumb • You need a systems’ programmer per MIPS • You need a Data Administrator per 10 GB

One Person per MegaBuck • 1 Breadbox ~ 5x 1987 machine room • 48 GB is hand-held • One person does all the work • Cost/tps is 1,000x less25 micro dollars per transaction • A megabuck buys 40 of these!!! 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays

People are buying computers by the gross After all, they only cost 1k$/slice! Clustering them together All God’s Children Have Clusters!Buying Computing By the Slice

It’s so natural,even mainframes cluster !Looking closer at usage patterns, a few models emerge Looking closer at sites, hierarchies bunches functional specializationemerge Which are the roses ? Which are the briars ? A cluster is a cluster is a cluster

“Commercial” NT Clusters • 16-node Tandem Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Compaq Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)

Tandem Oracle/NT • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks=2.7 TB

Microsoft.com: ~150x4 nodes The Microsoft.Com Site Building 11 Staging Servers Ave CFG: 4xP5, Log Processing (7) 512 RAM, Ave CFG: 4xP6, 30 GB HD 1 GB RAM, Internal WWW European Data Center Ave Cost: $35K 180 GB HD premium.microsoft.com IDC Staging Servers www.microsoft.com FY98 Fcst: 12 Ave Cost: $128K (1) FY98 Fcst: 2 MOSWest (3) Ave CFG: 4xP6, FTP Servers 512 RAM, SQLNet Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Ave Cost: $50K Router Download 30 GB HD DMZ Staging Servers FY98 Fcst: 1 Ave CFG: Replication 4xP6, Ave Cost: $28K 512 RAM, FY98 Fcst: 0 FTP Router Live SQL Servers 160 GB HD Download Server Ave Cost: $80K SQL Reporting Ave CFG: 4xP6, (1) FY98 Fcst: 1 MOSWest Switched Ave CFG: 4xP6, 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD All servers in Building11 512 RAM, 160 GB HD Ave Cost: $83K are accessable from 50 GB HD Ave Cost: $80K FY98 Fcst: 12 corpnet. Ave Cost: $35K FY98 Fcst: 2 FY98 Fcst: 2 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com Ave Cost: $43K 50 GB HD FY98 Fcst: 10 Ave CFG: premium.microsoft.com 4xP6, (2) (3) Ave Cost: $50K 512 RAM, Ave CFG: 4xP6, (1) FY98 Fcst: 17 Ave CFG: 4xP6, 30 GB HD home.microsoft.com 512 RAM, home.microsoft.com 512 RAM, Ave Cost: $28K 160 GB HD FDDI Ring (3) 50 GB HD FY98 Fcst: (4) 7 Ave Cost: $80K (MIS2) premium.microsoft.com Ave Cost: $50K FY98 Fcst: 1 Ave CFG: 4xP6 FY98 Fcst: 1 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, Ave Cost: $35K 30 GB HD Switched (MIS1) 512 RAM, FY98 Fcst: Ave CFG: 17 4xP6, Ave Cost: $35K Ethernet 30 GB HD 256 RAM, FY98 Fcst: 3 Ave Cost: $28K 30 GB HD FTP FY98 Fcst: 3 Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) Ave Cost: $24K (2) Router FY98 Fcst: 0 Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router (45 Mb/Sec Each) FDDI Ring home.microsoft.com (MIS3) www.microsoft.com msid.msn.com (2) (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD Ave Cost: $29K register.microsoft.com home.microsoft.com FY98 Fcst: 2 support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD (3) Ave Cost: $35K FY98 Fcst: 9 \\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd 12/15/97 Ave CFG: 4xP6, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 1 Ave CFG: 4xP6, 1 GB RAM, 160 GB HD Ave Cost: $83K FY98 Fcst: 2 Ave CFG: 4xP6, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 1 FTP.microsoft.com (3) Ave CFG: 4xP5, 512 RAM, 30 GB HD Ave Cost: $28K FY98 Fcst: 0

HotMail: ~400 Computers

Inktomi (hotbot), WebTV: > 200 nodes • Inktomi: ~250 UltraSparcs • web crawl • index crawled web and save index • Return search results on demand • Track Ads and click-thrus • ACID vs BASE (basic Availability, Serialized Eventually) • Web TV • ~200 UltraSparcs • Render pages, Provide Email • ~ 4 Network Appliance NFS file servers • A large Oracle app tracking customers

Loki: Pentium Clusters for Sciencehttp://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beowulf.html • Scientists want cheap mips.

Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LNL/Cray: ? Maui Supercomputer Center 512x1 SP2 Your Tax Dollars At WorkASCI for Stockpile Stewardship

Berkeley NOW (network of workstations) Projecthttp://now.cs.berkeley.edu/ • 105 nodes • Sun UltraSparc 170, 128 MB, 2x2GB disk • Myrinet interconnect (2x160MBps per node) • SBus (30MBps) limited • GLUNIX layer above Solaris • Inktomi (HotBot search) • NAS Parallel Benchmarks • Crypto cracker • Sort 9 GB per second

Wisconsin COW • 40 UltraSparcs 64MB + 2x2GB disk+ Myrinet • SUN OS • Used as a compute engine

Andrew Chien’s JBOBhttp://www-csag.cs.uiuc.edu/individual/achien.html • 48 nodes • 36 HP 2PIIx128 1 diskKayak boxes • 10 Compaq 2PIIx128 1 disk, Wkstation 6000 • 32-Myrinet&16-ServerNet connected • Operational • All running NT

NCSA Cluster • The National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 500 Pentium cpus, 2k disks, SAN • Compaq + HP +Myricom • A Super Computer for 3M$ • Classic Fortran/MPI programming • NT + DCOM programming model

4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)The Bricks of Cyberspace • Cost 1,000 $ • Come with • NT • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks

Super Server: 4T Machine CPU 50 GB Disc 5 GB RAM • Array of 1,000 4B machines • 1 b ips processors • 1 B B DRAM • 10 B B disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • Affordability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work

Cluster VisionBuying Computers by the Slice • Rack & Stack • Mail-order components • Plug them into the cluster • Modular growth without limits • Grow by adding small modules • Fault tolerance: • Spare modules mask failures • Parallel execution & data search • Use multiple processors and disks • Clients and servers made from the same stuff • Inexpensive: built with commodity CyberBricks

today’s PC is yesterday’s supercomputer Can use LOTS of them Main Apps changed: scientific  commercial  web Web & Transaction servers Data Mining, Web Farming Nostalgia Behemoth in the Basement

Directory based caching lets you build large SMPs Every vendor building a HUGE SMP 256 way 3x slower remote memory 8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs 64 bit addressing nUMA sensitive OS (not clear who will do it) Or Hypervisor like IBM LSF, Stanford Discowww-flash.stanford.edu/Hive/papers.html You get an expensive cluster-in-a-box with very fast network SMP -> nUMA: BIG FAT SERVERS

ThesisMany little beat few big 3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPEC marks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?

A Hypothetical QuestionTaking things to the limit • Moore’s law 100x per decade: • Exa-instructions per second in 30 years • Exa-bit memory chips • Exa-byte disks • Gilder’s Law of the Telecosom3x/year more bandwidth 60,000x per decade! • 40 Gbps per fiber today

Grove’s Law • Link Bandwidth doubles every 100 years! • Not much has happened to telephones lately • Still twisted pair

Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years • Today: • 10 Gbps per channel • 4 channels per fiber: 40 Gbps • 32 fibers/bundle = 1.2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps

CHALLENGE reduce software taxon messages Today 30 K ins + 10 ins/byte Goal: 1 K ins + .01 ins/byte Best bet: SAN/VIA Smart NICs Special protocol User-Level Net IO (like disk) Technology 10 GBps bus “now” 1 Gbps links “now” 1 Tbps links in 10 years Fast & cheap switches Standard interconnects processor-processor processor-device (=processor) Deregulation WILL work someday NetworkingBIG!! Changes coming!

TCP/IP Unix/NT 100% cpu @ 40MBps Disk Unix/NT 8% cpu @ 40MBps What if Networking Was as Cheap As Disk IO? Why the Difference? Host does TCP/IP packetizing, checksum,… flow control small buffers Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA

The Promise of SAN/VIA10x better in 2 years • Today: • wires are 10 MBps (100 Mbps Ethernet) • ~20 MBps tcp/ip saturates 2 cpus • round-trip latency is ~300 us • In two years • wires are 100 MBps (1 Gbps Ethernet, ServerNet,…) • tcp/ip ~ 100 MBps 10% of each processor • round-trip latency is 20 us • works in lab todayassumes app uses zero-copy Winsock2 api.See http://www.viarch.org/

Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB • Storage • Network • Display ASIC M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.

All Device Controllers will be Cray 1’s • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • economics (cyberbricks) • Move computation to data (minimize traffic) Central Processor & Memory Tera Byte Backplane

With Tera Byte Interconnectand Super Computer Adapters Tera Byte Backplane • Processing is incidental to • Networking • Storage • UI • Disk Controller/NIC is • faster than device • close to device • Can borrow device package & power • So use idle capacity for computation. • Run app in device.

Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Implications Tera Byte Backplane Central Processor & Memory Conventional Radical

Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams VIAL/VIPL VIAL/VIPL Wire(s)

Punch Line The huge clusters we saware prototypes for this: A Federation of Functionally specialized nodes Each node shrinks to a “point” device With embedded processing.Each node / device is autonomous Each talks a high-level protocol

Outline • Hardware CyberBricks • all nodes are very intelligent • Software CyberBricks • standard way to interconnect intelligent nodes • What next? • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system.

Software CyberBricks: Objects! • It’s a zoo • Objects and 3-tier computing (transactions) • Give natural distribution & parallelism • Give remote management! • TP & Web: Dispatch RPCs to pool of object servers • Components are a 1B$ business today!

Objects are Software CyberBricks productivity breakthrough (plug ins) manageability breakthrough (modules) Microsoft Promise DCOM + ActiveX + IBM/Sun/Oracle/Netscape promise CORBA + Open Doc + Java Beans + Both promise parallel distributed execution centralized management of distributed system Both camps Share key goals: Encapsulation: hide implementation Polymorphism: generic opskey to GUI and reuse Uniform Naming Discovery: finding a service Fault handling: transactions Versioning: allow upgrades Transparency: local/remote Security: who has authority Shrink-wrap: minimal inheritance Automation: easy The COMponent Promise

History and Alphabet Soup Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it UNIX International Open software Foundation (OSF) ODBC XA / TX Object Management Group (OMG) NT OSF DCE DCE RPC GUIDs IDL DNS Kerberos Solaris COM CORBA 1985 X/Open 1990 1995 Open Group COM

Objects Meet Databasesbasis for universal data servers, access, & integration Database Spreadsheet Photos Mail Map Document • Object-oriented (COM oriented) interface to data • Breaks DBMS into components • Anything can be a data source • Optimization/navigation “on top of” other data sources • Makes an RDBMS anO-R DBMS assuming optimizer understands objects DBMS engine

The BIG PictureComponents and transactions • Software modules are objects • Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) • Standard interfaces allow software plug-ins • Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated Object RequestBroker

The OO Points So Far • Objects are software Cyber Bricks • Object interconnect standards are emerging • Cyber Bricks become Federated Systems. • Next points: • put processing close to data • do parallel processing.

Transaction Processing Evolution to Three TierIntelligence migrated to clients Server green screen 3270 Active • Mainframe Batch processing (centralized) • Dumb terminals & Remote Job Entry • Intelligent terminals database backends • Workflow SystemsObject Request BrokersApplication Generators Mainframe cards TP Monitor ORB

Web Evolution to Three TierIntelligence migrated to clients (like TP) Mosaic NS & IE Active Web Server WAIS • Character-mode clients, smart servers • GUI Browsers - Web file servers • GUI Plugins - Web dispatchers - CGI • Smart clients - Web dispatcher (ORB)pools of app servers (ISAPI, Viper)workflow scripts at client & server archie ghopher green screen

PC Evolution to Three TierIntelligence migrated to server • Stand-alone PC (centralized) • PC + File & print servermessage per I/O • PC + Database server message per SQL statement • PC + App server message per transaction • ActiveX Client, ORB ActiveX server, Xscript IO request reply disk I/O SQL Statement Transaction

Why Did Everyone Go To Three-Tier? • Manageability • Business rules must be with data • Middleware operations tools • Performance (scaleability) • Server resources are precious • ORB dispatches requests to server pools • Technology & Physics • Put UI processing near user • Put shared data processing near shared data • Minimizes data moves • Encapsulate / modularity Presentation workflow Business Objects Database

Why Put Business Objects at Server? MOM’s Business Objects DAD’sRaw Data Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoice Customer pays clerk, gets goods Customer comes to store Takes what he wants Fills out invoice Leaves money for goods Easy to manage Clerks controls access Encapsulation Easy to build No clerks

Evolution of Computing: From Processing Power to Automatic Management