340 likes | 382 Vues
Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers. Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2011. Topics. How to cross the Petaflop boundary Ranking Nov 2008: crossing the Petaflop/s boundary
E N D
Advanced Computer Architecture5MD00 / 5Z033TOP 500supercomputers Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2011
Topics • How to cross the Petaflop boundary • Ranking • Nov 2008: crossing the Petaflop/s boundary • Nov 2009 / Nov 2010: what has been changed • Nov 2011: Japan "K Computer" on top: 10.51 Petaflop/s on Linpack using 705024 SPARC64 cores • 2nd : Chinese Tianhe-1A: 2.57 Petaflop/s • Examples • Roadrunner (IBM) • Jaguar Cray • SGI Altix • BlueGene ACA H.Corporaal
How to build a Petaflop supercomputer? Some examples from 2008: • Opteron cluster (e.g. ~2X Ranger/TACC) • 32,000 quad-core Opterons (130K cores) • Cray XT3/4 (e.g. Baker/ORNL sooner) • 32,000 quad-core Opterons (130K cores) • IBM BlueGene/P (bigger sooner) • 80,000 BG/P PPC processors (320K cores) • IBM Cell-accelerated Roadrunner cluster • 10,000 Cells (80K Cell SPUs) ACA H.Corporaal
Supercomputer Ranking • Started in 1993 • Jack Dongarra, University of Tennessee • Based on LINPACK benchmark • linear algebra (LU factorization) • Superseded by LAPACK • based on BLAS (Basic Lin. Alg. Subprograms) • exploits caches • Measures Floating Point performance • Fortran code • see http://www.top500.org ACA H.Corporaal
Single-Chip GPU v.s. Fastest Super Computers ref: http://www.llnl.gov/str/JanFeb05/Seager.html
Performance Ranking Nov. 2008 ACA H.Corporaal
Performance Ranking 2008: we crossed the Petaflop boundary ACA H.Corporaal
Update November 2009 ACA H.Corporaal
Update November 2010 ACA H.Corporaal
Update Nov 2011 • 1st : K COmputer: • 10.51 Petaflop/s on Linpack • 705024 SPARC64 cores (Fujitsu design) • Tofu interconnect (6-D torus) • 12.7 MegaWatt • 2nd : Chinese Tianhe-1A: • 2.57 Petaflop/s • 186368 cores (Xeon + NVDIA proc) • 4.0 MegaWatt ACA H.Corporaal
Alternative ranking: Green500 • Most Power efficient Supercomputers • See www.green500.org • 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation • 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation • Cell cluster, ranking 110 in top500 • 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation • IBM BlueGene/Q prototype, ranking 101 in top500, Peakperf: 65 TFlops; see also http://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_super/ ACA H.Corporaal
Energy cost At ~$1M per MW, energy costs are substantial • 1 petaflop in 2010 will use 3 MW • 1 exaflop in 2018 possible in 200 MW with “usual” scaling • 1 exaflop in 2018 at 20 MW is DOE target normal scaling desired scaling from: Katy Yelick, Berkeley ACA H.Corporaal
Nr1 (2008): Roadrunner • IBM cluster • 6480 nodes with • Dual core Opteron 1.8 GHz • 2 * PowerXCell 8i 3.2 GHz (12.8 GFlops) • Infiniband connection fabric (16 Gbit/s per link) • FAT tree interconnect • 100 Tbyte DRAM memory • 216 I/O nodes • MPI programming • 2.35 MW power !! • Size: 296 racks, 5500 ft2 This is huge !! ACA H.Corporaal
Cell/B.E. – the architecture • 1 x PPE 64-bit PowerPC • L1: 32 KB I$ + 32 KB D$ • L2: 512 KB • 8 x SPE cores: • Local store: 256 KB • 128 x 128 bit vector registers • Hybrid memory model: • PPE: Rd/Wr • SPEs: Asynchronous DMA • EIB: 205 GB/s sustained aggregate bandwidth • Processor-to-memory bandwidth: 25.6 GB/s • Processor-to-processor: 20 GB/s in each direction ACA H.Corporaal
Roadrunner: TriBlade = 2 nodes For more details: Presentation slides of Ken Koch, March 2008 ACA H.Corporaal
Nr2 (2008): Jaguar Cray XT5 QC • I guess 5 times • 7832 quad-core 2.1 GHz AMD Opetron • 62 TB memory (= 2GB / core) • 600 TB file system • 250 TFlop • In total 150152 cores • SeaStar2+ interconnect (from Cray) • Note 2009: quad-cores replaced by six-cores • now nr 1 • 224,256 cores • peak 1.75 PetaFlop • paper: Bland A.S., Kendall R.A., Kothe D.B., Rogers J.H., Shipman G.M. Jaguar: The World’s Most Powerful Computer ACA H.Corporaal
Jaguar ACA H.Corporaal
Nr3 (2008): SGI Altix ICE8200 • 92 racks of Al5x ICE • 8200EX with 3.0 Ghz Intel Xenon quad-core processors or • 47,104 cores • 8 racks of Al5x ICE 8200 • with 2.66 Ghz Intel quad-core • 4096 cores. • 51 TB Main memory • DDR InfiniBand ACA H.Corporaal
Nr:4 (2008) BlueGene/L IBM • Based on ASIC with PowerPC 440, 700 Mhz, each 2.8 GFlops • 105,496 nodes • 3D Torus interconnect for p2p communication + Collective network 3D-torus Complete system rack ACA H.Corporaal
BlueGene/L ASIC node ACA H.Corporaal
BlueGene/L Node board • 16 cards with 2 ASICs each • 8 GB • 180 Gflop ACA H.Corporaal
2009: BlueGene/P System: 256 racks upto 1PB 3.56 PFlops Rack: 32 Node Cards 13.9 TF/s 2-4 TB Node card: 32 processor cards 64-128 GB 435 GFlops Processor card: one 4-processor chip 13.6 GFlops 2-4 GB ASIC: 13.6 Gflops 8 MB EDRAM ACA H.Corporaal
BlueGene/P ASIC ACA H.Corporaal
PPC450: Exploiting SIMD • Two FPUs • 2 x 32 64-bit registers • SIMD • Datapath width = 16 bytes • Feeds two FPUs, with 8 bytes each, every cycle • Two FP multiply-add operations per cycle • 3.4 GFLOP/s peak performance ACA H.Corporaal
BlueGene/PASIC • 208M trans • 850 MHz • 16W • 90nm ACA H.Corporaal
BlueGene/P node card ACA H.Corporaal
Next: BlueGene/Q • 10 PFlops in 2011-2012 • see www.research.ibm.com/bluegene ACA H.Corporaal
Can we match the human brain ??? • Performance = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 200 (2 * 10^2) Calculations Per Second Per Connection = 2 * 10^16 Calculations Per Second • Memory = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 10 bytes (information about connection strength and adress of output neuron, type of synapse) = 10^15 bytes = 1 PB = 1000 TB How far off are we? ACA H.Corporaal
Software replica of one column of the neocortex cortex: 85% of brains total mass required for language, learning, memory and complex thought the essential first step to simulating the whole brain Next: include circuitry from other brain regions and eventually the whole brain. Blue brain research ACA H.Corporaal
Latest news: factorization of RSA768 • RSA used to encypher text using both public and private key • EPFL, CWI and others have broken RSA768 • This means: Factorize 768 bit number into 2 primes • Using 1700 AMD 2.2 GHz cores for 1 year =>15 Mh (single core) compute time • Current RSA standard uses 1024 bits • still save for some years ACA H.Corporaal
RSA (Rivest, Shamir, Adleman) • choose 2 (large) primes p and q • n = p*q • choose e such that e and (p-1)(q-1) are coprime (i.e. do not share prime factors) • choose d such d*e = 1 mod ((p-1)(q-1)) • public key = (n,e) private key = (n,d) • Encryption of message m: c=me mod n • Decryption of cypher c: m = cd mod n • see wikipedia for details and working example ACA H.Corporaal
RSA factorization result • factorization of RSA768, the following 768-bit, 232-digit number from RSA's challenge list: • 12301866845301177551304949583849627207728535695953347921973224215172640050726365751874520219978646938995647494277406384592519255732630345373154826850791702612214291346167042921431160222124047927473779408066535141959745985 6902143413=33478071698956898786044169848212690817704794983713768568912431388982883793878002287614711652531743087737814467999489*36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396810270092798736308917 ACA H.Corporaal