Download
pact 98 n.
Skip this Video
Loading SlideShow in 5 Seconds..
PACT 98 PowerPoint Presentation

PACT 98

12 Vues Download Presentation
Télécharger la présentation

PACT 98

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. PACT 98 Http://www.research.microsoft.com/barc/gbell/pact.ppt

  2. What Architectures? Compilers? Run-time environments? Programming models? … Any Apps?Parallel Architectures and Compilers TechniquesParis, 14 October 1998 Gordon Bell Microsoft

  3. Talk plan • Where are we today? • History… predicting the future • Ancient • Strategic Computing Initiative and ASCI • Bell Prize since 1987 • Apps & architecture taxonomy • Petaflops: when, … how, how much • New ideas: Grid, Globus, Legion • Bonus: Input to Thursday panel

  4. 1998: ISVs, buyers, & users? • Technical: supers dying; DSM (and SMPs) trying • Mainline: user & ISV apps ported to PCs & workstations • Supers (legacy code) market lives ... • Vector apps (e.g ISVs) ported to DSM (&SMP) • MPI for custom and a few, leading edge ISVs • Leading edge, one-of-a-kind apps: Clusters of 16, 256, ...1000s built from uni, SMP, or DSM • Commercial: mainframes, SMPs (&DSMs), and clusters are interchangeable (control is the issue) • Dbase & tp: SMPs compete with mainframes if central control is an issue else clusters • Data warehousing: may emerge… just a Dbase • High growth, web and stream servers: Clusters have the advantage

  5. c2000 Architecture Taxonomy Xpt connected SMPS Xpt-SMPvector Xpt-multithread (Tera) “multi” Xpt-”multi” hybrid DSM- SCI (commodity) DSM (high bandwidth_ Commodity “multis” & switches Proprietary “multis”& switches Proprietary DSMs mainline SMP Multicomputers akaClusters … MPP 16-(64)- 10K processors mainline

  6. 500 Other Japanese DEC 400 Intel TMC Sun HP 300 IBM Convex 200 SGI 100 CRI 0 Jun-96 Nov-96 Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-97 Nov-97 Jun-98 TOP500 Technical Systems by Vendor (sans PC and mainframe clusters)

  7. 1% 3% 7% 6% 9% 2% 8% 9% 19% 40% 1 2 19% 21% 3-4 5-8 9-16 17% 17-32 5% 16% 33-64 18% 65-128 20 Weeks of Data, March 16 - Aug 2, 1998 15,028 Jobs / 883,777 CPU-Hrs Parallelism of Jobs On NCSA Origin Cluster by # of Jobs by CPU Delivered # CPUs

  8. 120,000 100,000 80,000 CPU Hrs 60,000 Delivered 40,000 20,000 0-64 0 64-128 1 2 128-256 3-4 256-384 5-8 Mem/CPU 9-16 384-512 17-32 (MB) 512+ 33-64 65-128 # CPUs How are users using the Origin Array?

  9. National Academic Community Large Project Requests September 1998 Over 5 Million NUs Requested One NU = One XMP Processor-Hour Source: National Resource Allocation Committee

  10. Gordon’s WAG GB's Estimate of Parallelism in Engineering & Scientific Applications ----scalable multiprocessors----- PCs WSs Supers Clusters aka MPPsaka multicomputers dusty decks for supers new or scaled-up apps log (# apps) scalar 60% vector 15% Vector& // 5% One-of>>// 5% Embarrassingly & perfectly parallel 15% granularity & degree of coupling (comp./comm.)

  11. Application Taxonomy General purpose, non-parallelizable codes(PCs have it!) Vectorizable Vectorizable & //able(Supers & small DSMs) Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs...) DatabaseDatabase/TP Web Host Stream Audio/Video Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters

  12. One procerssor perf. as % of Linpack 22% CFD Biomolec. Chemistry Materials QCD 25% 19% 14% 33% 26%

  13. Gordon’s WAG 10 Processor Linpack (Gflops); 10 P appsx10; Apps % 1 P Linpack; Apps %10 P Linpack

  14. Ancient history

  15. Growth in Computational Resources Used for UK Weather Forecasting 1010/ 50 yrs = 1.5850 10T • 1T • 100G • 10G • 1G • 100M • 10M • 1M • 100K • 10K • 1K • 100 • 10 • YMP 205 195 KDF9 Mercury Leo • 1950 • 2000

  16. Harvard Mark I aka IBM ASCC

  17. I think there is a world market for maybe five computers. “ ” Thomas Watson Senior, Chairman of IBM, 1943

  18. The scientific market is still about that size… 3 computers • When scientific processing was 100% of the industry a good predictor • $3 Billion: 6 vendors, 7 architectures • DOE buys 3 very big ($100-$200 M) machines every 3-4 years

  19. NCSA Cluster of 6 x 128 processors SGI Origin

  20. Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LNL/Cray: ? Maui Supercomputer Center 512x1 SP2 Our Tax Dollars At WorkASCI for Stockpile Stewardship

  21. “LARC doesn’t need 30,000 words!” --Von Neumann, 1955. • “During the review, someone said: “von Neumann was right. 30,000 word was too much IF all the users were as skilled as von Neumann ... for ordinary people, 30,000 was barely enough!” -- Edward Teller, 1995 • The memory was approved. • Memory solves many problems!

  22. Parallel processing computer architectures will be in use by 1975. “ Navy Delphi Panel1969 ”

  23. In Dec. 1995 computers with 1,000 processors will do most of the scientific processing. “ Danny Hillis 1990 (1 paper or 1 company) ”

  24. The Bell-Hillis BetMassive Parallelism in 1995 TMC World-wide Supers TMC World-wide Supers TMC World-wide Supers Applications Petaflops / mo. Revenue

  25. Bell-Hillis Bet: wasn’t paid off! • My goal was not necessarily to just win the bet! • Hennessey and Patterson were to evaluate what was really happening… • Wanted to understand degree of MPP progress and programmability

  26. DARPA, 1985 Strategic Computing Initiative (SCI) ” A 50 X LISP machine Tom Knight, Symbolics A 1,000 node multiprocessorA Teraflops by 1995 Gordon Bell, Encore “ ” “ ” èAll of ~20 HPCC projects failed!

  27. SCI (c1980s): Strategic Computing Initiative funded ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, Thinking Machines (Connection Machine),

  28. Those who gave up their lives in SCI’s search for parallellism Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC (independent of ETA), Cogent, Culler, Cydrome, Dennelcor, Elexsi, ETA, Evans & Sutherland Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, Multiflow, Myrias, Pixar, Prisma, SAXPY, SCS, Supertek (part of Cray), Suprenum (German National effort), Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Vitec, Vitesse, Wavetracer.

  29. Worlton: "Bandwagon Effect"explains massive parallelism Bandwagon: A propaganda device by which the purported acceptance of an idea ...is claimed in order to win further public acceptance. Pullers: vendors, CS community Pushers: funding bureaucrats & deficit Riders: innovators and early adopters 4 flat tires: training, system software, applications, and "guideposts" Spectators: most users, 3rd party ISVs

  30. Parallel processing is a constant distance away. “ Our vision ... is a system of millions of hosts… in a loose confederation. Users will have the illusion of a very powerful desktop computer through which they can manipulate objects. Grimshaw, Wulf, et al “Legion” CACM Jan. 1997 ” “ ”

  31. Progress"Parallelism is a journey.*" *Paul Borrill

  32. Let us not forget:“The purpose of computing is insight, not numbers.” R. W. Hamming

  33. Progress 1987-1998

  34. Bell Prize Peak Gflops vs time

  35. Bell Prize: 1000x 1987-1998 • 1987 Ncube 1,000 computers: showed with more memory, apps scaled • 1987 Cray XMP 4 proc. @200 Mflops/proc • 1996 Intel 9,000 proc. @200 Mflops/proc 1998 600 RAP Gflops Bell prize • Parallelism gains • 10x in parallelism over Ncube • 2000x in parallelism over XMP • Spend 2- 4x more • Cost effect.: 5x; ECL è CMOS; Sram è Dram • Moore’s Law =100x • Clock: 2-10x; CMOS-ECL speed cross-over

  36. No more 1000X/decade.We are now (hopefully) only limited by Moore’s Law and not limited by memory access. 1 GF to 10 GF took 2 years 10 GF to 100 GF took 3 years 100 GF to 1 TF took >5 years 2n+1 or 2^(n-1)+1?

  37. Commercial Perf/$

  38. Commercal Perf.

  39. 1998 Observations vs1989 Predictions for technical • Got a TFlops PAP 12/1996 vs 1995. Really impressive progress! (RAP<1 TF) • More diversity… results in NO software! • Predicted: SIMD, mC, hoped for scalable SMP • Got: Supers, mCv, mC, SMP, SMP/DSM,SIMD disappeared • $3B (un-profitable?) industry; 10 platforms • PCs and workstations diverted users • MPP apps DID NOT materialize

  40. Observation: CMOS supers replaced ECL in Japan • 2.2 Gflops vector units have dual use • In traditional mPv supers • as basis for computers in mC • Software apps are present • Vector processor out-performs n micros for many scientific apps • It’s memory bandwidth, cache prediction, and inter-communication

  41. Observation: price & performance • Breaking $30M barrier increases PAP • Eliminating “state computers” increased prices, but got fewer, more committed suppliers, less variation, and more focus • Commodity micros aka Intel are critical to improvement. DEC, IBM, and SUN are ?? • Conjecture: supers and MPPs may be equally cost-effective despite PAP • Memory bandwidth determines performance & price • “You get what you pay for ” aka “there’s no free lunch”

  42. Observation: MPPs 1, Users <1 • MPPs with relatively low speed micros with lower memory bandwidth, ran over supers, but didn’t kill ‘em. • Did the U.S. industry enter an abyss? • Is crying “Unfair trade” hypocritical? • Are users denied tools? • Are users not “getting with the program” • Challenge we must learn to program clusters... • Cache idiosyncrasies • Limited memory bandwidth • Long Inter-communication delays • Very large numbers of computers

  43. Strong recommendation: Utilize in situ workstations! • NoW (Berkeley) set sort record, decrypting • Grid, Globus, Condor and other projects • Need “standard” interface and programming model for clusters using “commodity” platforms & fast switches • Giga- and tera-bit links and switches allow geo-distributed systems • Each PC in a computational environment should have an additional 1GB/9GB!

  44. Petaflops by 2010 “ ” DOEAccelerated Strategic Computing Initiative (ASCI)

  45. DOE’s 1997 “PathForward” Accelerated Strategic Computing Initiative (ASCI) • 1997 1-2 Tflops: $100M • 1999-2001 10-30 Tflops $200M?? • 2004 100 Tflops • 2010 Petaflops

  46. When is a Petaflops possible? What price? ” • Moore’s Law 100xBut how fast can the clock tick? • Increase parallelism 10K>100K 10x • Spend more ($100M è $500M) 5x • Centralize center or fast network 3x • Commoditization (competition) 3x Gordon Bell, ACM 1997

  47. 60%= Exaops 40%= Petaops 20%= Teraops Micros gains if 20, 40, & 60% / year 1.E+21 1.E+18 1.E+15 1.E+12 1.E +9 1.E+6 1995 2005 2015 2025 2035 2045

  48. µProc 60%/yr.. 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr.. DRAM 1 1992 2000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1993 1994 1995 1996 1997 1998 1999 Processor Limit: DRAM Gap “Moore’s Law” • Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors • *Taken from Patterson-Keeton Talk to SigMod

  49. Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile is requiredacross generations of computers Reliability scaling Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,

  50. Gordon’s WAG The Law of Massive Parallelism (mine) is based on application scaling There exists a problem that can be made sufficiently large such that any network of computers can run efficiently given enough memory, searching, & work -- but this problem may be unrelated to no other. A ... any parallel problem can be scaled to run efficiently on an arbitrary network of computers, given enough memory and time… but it may be completely impractical Challenge to theoreticians and tool builders:How well will or will an algorithm run? Challenge for software and programmers: Can package be scalable & portable? Are there models? Challenge to users: Do larger scale, faster, longer run times, increase problem insight and not just total flop or flops? Challenge to funders: Is the cost justified?