Scaleabilty

Scaleabilty Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree Jim Gray Gray@Microsoft.com (with help from Gordon Bell, George Spix, Catharine van Ingen 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party

A peta-op business app? • P&G and friends pay for the web (like they paid for broadcast television) – no new money, but given Moore, traditional advertising revenues can pay for all of our connectivity - voice, video, data…… (presuming we figure out how to & allow them to brand the experience.) • Advertisers pay for impressions and ability to analyze same. • A terabyte sort a minute – to one a second. • Bisection bw of ~20gbytes/s – to ~200gbytes/s. • Really a tera-op business app (today’s portals)

SMP Super Server Departmental Server Personal System ScaleabilityScale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

There'll be Billions Trillions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous

Trillions Billions Of ClientsNeed Millions Of Servers Billions • All clients networked to servers • May be nomadicor on-demand • Fast clients wantfaster servers • Servers provide • Shared Data • Control • Coordination • Communication Clients Mobileclients Fixedclients Servers Server Super server

3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Nano Micro 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance & Management?

4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)The Bricks of Cyberspace • Cost 1,000 $ • Come with • NT • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks

Kilo Mega Giga Tera Peta Exa Zetta Yotta Computers shrink to a point • Disks 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk is super computer! • This is already true of printers and “terminals”

CPU 50 GB Disc 5 GB RAM Super Server: 4T Machine • Array of 1,000 4B machines • 1 b ips processors • 1 B B DRAM • 10 B B disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • Affordability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work

Cluster VisionBuying Computers by the Slice • Rack & Stack • Mail-order components • Plug them into the cluster • Modular growth without limits • Grow by adding small modules • Fault tolerance: • Spare modules mask failures • Parallel execution & data search • Use multiple processors and disks • Clients and servers made from the same stuff • Inexpensive: built with commodity CyberBricks

Systems 30 Years Ago • MegaBuck per Mega Instruction Per Second (mips) • MegaBuck per MegaByte • Sys Admin & Data Admin per MegaBuck

Disks of 30 Years Ago • 10 MB • Failed every few weeks

1988: IBM DB2 + CICS Mainframe65 tps • IBM 4391 • Simulated network of 800 clients • 2m$ computer • Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB

1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem) • A dozen people (1.8M$/y) • False floor, 2 rooms of machines Admin expert 32 node processor array Performance expert Hardware experts Simulate 25,600 clients Network expert Auditor Manager 40 GB disk array (80 drives) DB expert OS expert

1997: 9 years later1 Person and 1 box = 1250 tps • 1 Breadbox ~ 5x 1987 machine room • 23 GB is hand-held • One person does all the work • Cost/tps is 100,000x less5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays

mainframe mini price micro time What Happened?Where did the 100,000x come from? • Moore’s law: 100X (at most) • Software improvements: 10X (at most) • Commodity Pricing: 100X (at least) • Total 100,000X • 100x from commodity • (DBMS was 100K$ to start: now 1k$ to start • IBM 390 MIPS is 7.5K$ today • Intel MIPS is 10$ today • Commodity disk is 50$/GB vs 1,500$/GB • ...

SGI O2K UE10K DELL 6350 Cray T3E IBM SP2 PoPC per sqft cpus 2.1 4.7 7.0 4.7 5.0 13.3 specint 29.0 60.5 132.7 79.3 72.3 253.3 ram 4.1 4.7 7.0 0.6 5.0 6.8 gb disks 1.3 0.5 5.2 0.0 2.5 13.3 Web & server farms, server consolidation / sqft http://www.exodus.com (charges by mbps times sqft) Standard package, full height, fully populated, 3.5” disks HP, DELL, Compaq are trading places wrt rack mount lead PoPC – Celeron NLX shoeboxes – 1000 nodes in 48 (24x2) sq ft. $650K from Arrow (3yr warrantee!) on chip at speed L2

Application Taxonomy General purpose, non-parallelizable codesPCs have it! Vectorizable Vectorizable & //able(Supers & small DSMs) Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs) DatabaseDatabase/TP Web Host Stream Audio/Video Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters

Peta Scale Computing Peta scale w/ traditional balance 2000 2010 1 PIPS processors 1015 ips 106 cpus @109 ips 104 cpus @1011 ips 10 PB of DRAM 108 chips @107 bytes 106 chips @109 bytes 10 PBps memory bandwidth 1 PBps IO bandwidth 108 disks 107 Bps 107 disks 108 Bps 100 PB of disk storage 105 disks 1010 B 103 disks 1012 B 10 EB of tape storage 107 tapes 1010 B 105 tapes 1012 B 10x every 5 years, 100x every 10 (1000x in 20 if SC) Except --- memory & IO bandwidth

I think there is a world market for maybe five computers. “ ” Thomas Watson Senior, Chairman of IBM, 1943

Building 11 Staging Servers (7) Ave CFG: 4xP6, Internal WWW Ave CFG: 4xP5, European Data Center premium.microsoft.com IDC Staging Servers 512 RAM, www.microsoft.com 30 GB HD (1) MOSWest (3) Ave CFG: 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Servers 512 RAM, SQLNet 30 GB HD Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Router Download 30 GB HD DMZ Staging Servers Ave CFG: Replication 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Router 1 GB RAM, Live SQL Servers 160 GB HD Download Server 160 GB HD SQL Reporting Ave Cost: $83K Ave CFG: 4xP6, (1) MOSWest Switched Ave CFG: FY98 Fcst: 4xP6, 2 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD 512 RAM, 160 GB HD Ave Cost: $83K 50 GB HD FY98 Fcst: 12 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com 50 GB HD Ave CFG: premium.microsoft.com 4xP6, (2) (3) 512 RAM, Ave CFG: 4xP6, (1) 30 GB HD home.microsoft.com 512 RAM, Ave CFG: 4xP6, home.microsoft.com Ave CFG: 4xP6, Ave Cost: $28K 160 GB HD FDDI Ring 512 RAM, (3) 512 RAM, FY98 Fcst: (4) 7 (MIS2) 50 GB HD premium.microsoft.com 30 GB HD Ave CFG: 4xP6 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, 30 GB HD Switched (MIS1) 512 RAM, Ave CFG: 4xP6, Ethernet 30 GB HD 256 RAM, 30 GB HD FTP Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) (2) Router Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router FTP.microsoft.com (45 Mb/Sec Each) (3) FDDI Ring Ave CFG: 4xP5, home.microsoft.com (MIS3) www.microsoft.com msid.msn.com 512 RAM, (2) 30 GB HD (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD register.microsoft.com home.microsoft.com support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD Microsoft.com: ~150x4 nodes: a crowd (3)

HotMail (a year ago): ~400 Computers Crowd (now 2x bigger)

DB Clusters (crowds) • 16-node Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)

Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware

35 Total Average Peak 71 30 Hits 1,065 m 8.1 m 29 m 25 Queries 877 m 6.7 m 18 m Sessions 20 Hit Count Page View Images DB Query 742 m 5.6m 15 m 15 Image Page Views 170 m 1.3 m 6.6 m 10 Users 76 k 6.4 m 48 k 5 Sessions 10 m 77 k 125 k 0 7/6/98 8/3/98 9/7/98 6/22/98 6/29/98 7/13/98 7/20/98 7/27/98 8/10/98 8/17/98 8/24/98 8/31/98 9/14/98 9/21/98 9/28/98 10/5/98 10/12/98 10/19/98 10/26/98 Date TerraServer: Lots of Web Hits • A billion web hits! • 1 TB, largest SQL DB on the Web • 100 Qps average, 1,000 Qps peak • 877 M SQL queries so far

TerraServer Availability • Operating for 13 months • Unscheduled outage: 2.9 hrs • Scheduled outage: 2.0 hrsSoftware upgrades • Availability: 99.93% overall up • No NT failures (ever) • One SQL7 Beta2 bug • One major operator-assisted outage

Backup / Restore

Windows NT Versus UNIXBest Results on an SMP: SemiLog plot shows 3x (~2 year) lead by UNIX Does not show Oracle/Alpha Cluster at 100,000 tpmCAll these numbers are off-scale huge (40,000 active users?)

TPC C Improvements (MS SQL) 250%/year on Price, 100%/year performancebottleneck is 3GB address space 40% hardware, 100% software, 100% PC Technology

UNIX (dis) Economy Of Scale

Two different pricing regimesThis is late 1998 prices

Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Los Angeles 100 Memory This Resort 10 min 10 On Board Cache This Room 2 On Chip Cache 1 Registers My Head 1 min Storage Latency: How far away is the data?

I-Cache B-Cache Miss D-Cache Data Miss Miss Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components Disc Wait Sort Sort Disc Wait OS Memory Wait

Storage Hierarchy (10 levels) Registers, Cache L1, L2 Main (1, 2, 3 if nUMA). Disk (1 (cached), 2) Tape (1 (mounted), 2)

15 4 10 10 12 2 10 10 9 0 10 10 6 -2 10 10 3 -4 10 10 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Size vs Speed Price vs Speed Cache Nearline Tape Offline Main Tape Disc Secondary Online Online $/MB Secondary Tape Tape Disc Typical System (bytes) Main Offline Nearline Tape Tape Cache -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds)

Meta-Message: Technology Ratios Are Important • If everything gets faster & cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER: • communication speed & cost 1,000x • processor speed & cost 100x • storage size & cost 100x • Things staying about the same • speed of light (more or less constant) • people (10x more expensive) • storage speed (only 10x better)

Storage Ratios Changed • 10x better access time • 10x more bandwidth • 4,000x lower media price • DRAM/DISK 100:1 to 10:10 to 50:1

The Pico Processor 1 M SPECmarks 106 clocks/ fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Terror Bytes!

Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits MemoryRead/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps

Adapter ~70 MBps PCI ~110 MBps Adapter Memory Read/Write ~250 MBps Adapter PCI Adapter Bottleneck Analysis • NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI 64 ~ 155 MBps Unbuffered read (175 raw) ~ 95 MBps Unbuffered write Good, but 10x down from our UNIX brethren (SGI, SUN) 155 MBps

PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE disks • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec

Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. • Hardware and Software cost • Depreciated over 3 years • 1M$ system gets about 1 second, • 1K$ system gets about 1,000 seconds. • Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is • 100-byte records (random data) • key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories

How Good is NT5 Sort? • CPU and IO not overlapped. • System should be able to sort 2x more • RAM has spare capacity • Disk is space saturated (1.5GB in, 1.5GB out on 3GB drive.) Need an extra 3GB drive or a >6GB drive Disk Fixed CPU ram

Sandia/Compaq/ServerNet/NT Sort • Sort 1.1 Terabyte (13 Billion records) in 47 minutes • 68 nodes (dual 450 Mhz processors)543 disks, 1.5 M$ • 1.2 GBps network rap (2.8 GBps pap) • 5.2 GBps of disk rap (same as pap) • (rap=real application performance,pap= peak advertised performance)

SP sort • 2 – 4 GBps!

Progress on Sorting: NT now leads both price and performance • Speedup comes from Moore’s law 40%/year • Processor/Disk/Network arrays: 60%/year (this is a software speedup).

Recent Results • NOW Sort: 9 GB on a cluster of 100 UltraSparcs in 1 minute • MilleniumSort: 16x Dell NT cluster: 100 MB in 1.18 Sec (Datamation) • Tandem/Sandia Sort: 68 CPU ServerNet 1 TB in 47 minutes • IBM SPsort 408 nodes, 1952 cpu 2168 disks 17.6 minutes = 1057sec (all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect

Data GravityProcessing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in • Modem • Display • Microphones (speech recognition) & cameras (vision) • Storage: Data storage and analysis • System is “distributed” (a cluster/mob)

RIP FDDI RIP ATM RIP FC RIP SCI RIP ? RIP SCSI SAN: Standard Interconnect Gbps SAN: 110 MBps • LAN faster than memory bus? • 1 GBps links in lab. • 100$ port cost soon • Port is computer • Winsock: 110 MBps(10% cpu utilization at each end) PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps

Disk = Node • has magnetic storage (100 GB?) • has processor & DRAM • has SAN attachment • has execution environment Applications Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel

Scaleabilty

Scaleabilty

Presentation Transcript

Scaleabilty